
EE 641 - Unit 2
Fall 2025
Backpropagation Through Convolutions
[Architecture] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
[Architecture] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning, 2019, pp. 6105–6114.
[Detection] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.
[Detection] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
[Detection] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
[Detection] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
[Detection] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” Proceedings of the IEEE, vol. 111, no. 3, pp. 257–276, 2023.
[Pose] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: Realtime multi-person 2D pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172–186, 2019.
Segmentation K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.
Segmentation A. Kirillov et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026.
For layer \(\ell\) with pre-activation \(z^\ell = W^\ell a^{\ell-1} + b^\ell\):
\[\delta^\ell = \frac{\partial \mathcal{L}}{\partial z^\ell} = (W^{\ell+1})^T \delta^{\ell+1} \odot \sigma'(z^\ell)\]
Weight gradient: \[\frac{\partial \mathcal{L}}{\partial W^\ell_{ij}} = \delta^\ell_i \cdot a^{\ell-1}_j\]
Matrix form: \[\frac{\partial \mathcal{L}}{\partial W^\ell} = \delta^\ell (a^{\ell-1})^T\]
The transpose operation \((W^{\ell+1})^T\) reverses the linear transformation during backpropagation.

For 2D discrete signals \(x\) and kernel \(h\):
\[(x * h)[i,j] = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} h[m,n] \cdot x[i+m, j+n]\]
where \(K\) is the kernel dimension.
CNNs implement cross-correlation (not convolution):
This distinction is academic during learning since kernels are learned.

One output element requires \(K^2\) multiply-accumulate operations
Given loss gradient \(\frac{\partial \mathcal{L}}{\partial y[i,j]}\), we require \(\frac{\partial \mathcal{L}}{\partial x[p,q]}\).
Element \(x[p,q]\) contributes to output positions: \[\{y[i,j] : p-K+1 \leq i \leq p, \; q-K+1 \leq j \leq q\}\]
By chain rule: \[\frac{\partial \mathcal{L}}{\partial x[p,q]} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial y[i,j]} \cdot \frac{\partial y[i,j]}{\partial x[p,q]}\]
Since \(y[i,j] = \sum_{m,n} h[m,n] \cdot x[i+m, j+n]\):
\[\frac{\partial y[i,j]}{\partial x[p,q]} = \begin{cases} h[p-i, q-j] & \text{if } 0 \leq p-i < K, \; 0 \leq q-j < K \\ 0 & \text{otherwise} \end{cases}\]

\(x[3,3]\) influences a \(3 \times 3\) region in output (for \(K=3\))
Substituting the partial derivative:
\[\frac{\partial \mathcal{L}}{\partial x[p,q]} = \sum_{i=p-K+1}^{p} \sum_{j=q-K+1}^{q} \frac{\partial \mathcal{L}}{\partial y[i,j]} \cdot h[p-i, q-j]\]
Change of variables: \(m = p-i\), \(n = q-j\):
\[\frac{\partial \mathcal{L}}{\partial x[p,q]} = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} h[m,n] \cdot \frac{\partial \mathcal{L}}{\partial y[p-m, q-n]}\]
This is convolution with a rotated kernel:
\[\boxed{\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} * h^{\text{rot180}}}\]
where \(h^{\text{rot180}}[m,n] = h[K-1-m, K-1-n]\)

180° rotation: \(h[m,n] \rightarrow h[K-1-m, K-1-n]\)
To recover original dimensions: \[\text{pad} = K - 1\]
This yields “full” convolution in backward pass.

Each kernel element \(h[m,n]\) appears in computing all output positions:
\[\frac{\partial \mathcal{L}}{\partial h[m,n]} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial y[i,j]} \cdot \frac{\partial y[i,j]}{\partial h[m,n]}\]
From \(y[i,j] = \sum_{m',n'} h[m',n'] \cdot x[i+m', j+n']\):
\[\frac{\partial y[i,j]}{\partial h[m,n]} = x[i+m, j+n]\]
Therefore: \[\boxed{\frac{\partial \mathcal{L}}{\partial h[m,n]} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial y[i,j]} \cdot x[i+m, j+n]}\]
This is the correlation between input and output gradient.

Each weight gradient accumulates contributions from all spatial positions where that weight was applied.
Output dimension: \[H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} - K}{s} \right\rfloor + 1\]
Convolution with stride \(s\): \[y[i,j] = \sum_{m,n} h[m,n] \cdot x[si+m, sj+n]\]
Insert \((s-1)\) zeros between gradient elements: \[\tilde{g}[i,j] = \begin{cases} \frac{\partial \mathcal{L}}{\partial y[i/s, j/s]} & \text{if } s|i \text{ and } s|j \\ 0 & \text{otherwise} \end{cases}\]
Then apply standard transposed convolution to \(\tilde{g}\).


For \(K \times K\) pooling window:
\[\text{Forward}: \quad y[i,j] = \frac{1}{K^2} \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} x[Ki+m, Kj+n]\]
\[\text{Backward}: \quad \frac{\partial \mathcal{L}}{\partial x[p,q]} = \frac{1}{K^2} \cdot \frac{\partial \mathcal{L}}{\partial y[\lfloor p/K \rfloor, \lfloor q/K \rfloor]}\]
The gradient is uniformly distributed across the pooling window.
\[y[i,j,c_o] = \sum_{c_i=0}^{C_{\text{in}}-1} \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} h[m,n,c_i,c_o] \cdot x[i+m,j+n,c_i]\]
Input gradient (sum over output channels): \[\frac{\partial \mathcal{L}}{\partial x[p,q,c_i]} = \sum_{c_o=0}^{C_{\text{out}}-1} \left( \frac{\partial \mathcal{L}}{\partial y[\cdot,\cdot,c_o]} * h^{\text{rot180}}[\cdot,\cdot,c_i,c_o] \right)[p,q]\]
Weight gradient (correlation structure): \[\frac{\partial \mathcal{L}}{\partial h[m,n,c_i,c_o]} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial y[i,j,c_o]} \cdot x[i+m,j+n,c_i]\]

The im2col transformation linearizes the convolution operation:
# Forward pass
X_col = im2col(X, kernel_size=K, stride=s, padding=p) # Shape: (K²·C_in, H'·W')
W_row = W.reshape(C_out, K²·C_in) # Shape: (C_out, K²·C_in)
Y_col = W_row @ X_col # Shape: (C_out, H'·W')
Y = Y_col.reshape(H', W', C_out)
# Backward pass
dY_col = dY.reshape(C_out, H'·W')
dW = dY_col @ X_col.T # Weight gradient
dX_col = W_row.T @ dY_col # Input gradient (columnar)
dX = col2im(dX_col, shape=(H, W, C_in)) # Reshape to spatialThis reformulation enables hardware acceleration through optimized BLAS libraries.
Standard convolution complexity: \[O(K^2 \cdot C_{\text{in}} \cdot C_{\text{out}} \cdot H_{\text{out}} \cdot W_{\text{out}})\]
Example: Single ResNet-50 layer
Mobile constraint: <500M MACs total

Standard convolution combines two operations:
\[y[i,j,c_o] = \sum_{c_i} \sum_{m,n} h[m,n,c_i,c_o] \cdot x[i+m,j+n,c_i]\]
These operations can be separated for efficiency.
\[y[i,j,c_o] = \sum_{c_i} w[c_i,c_o] \cdot \left(\sum_{m,n} h'[m,n,c_i] \cdot x[i+m,j+n,c_i]\right)\]

Key Concept: Each input channel is convolved with its own kernel
Parameters: \(K^2 \times C_{in}\) (vs \(K^2 \times C_{in} \times C_{out}\) for standard)
Example:


Key Concept: Mix channels without spatial filtering
Parameters: \(C_{in} \times C_{out}\)
Use Cases:
Example:


Key Concept: Divide channels into groups
Parameters: \(\frac{K^2 \times C_{in} \times C_{out}}{g}\)
Trade-offs:
ShuffleNet Solution:
Apply K×K filter to each channel independently: \[\hat{x}[i,j,c] = \sum_{m,n} h_{\text{dw}}[m,n,c] \cdot x[i+m,j+n,c]\]
Parameters: \(K^2 \cdot C_{\text{in}}\)
Mix channels with 1×1 convolution: \[y[i,j,c_o] = \sum_{c_i} h_{\text{pw}}[c_i,c_o] \cdot \hat{x}[i,j,c_i]\]
Parameters: \(C_{\text{in}} \cdot C_{\text{out}}\)
\[\frac{\text{Depthwise Separable}}{\text{Standard}} = \frac{K^2 C + CC'}{K^2 CC'} = \frac{1}{C'} + \frac{1}{K^2}\]


Total: 569M MACs, 4.2M parameters (vs VGG16: 15.3B MACs, 138M parameters)
Low-dimensional ReLU causes information loss:

Linear bottleneck: no activation after projection

Expansion factor \(t=6\) typical
Divide channels into \(g\) groups:
\[y[i,j,c_o] = \sum_{c_i \in \mathcal{G}(c_o)} \sum_{m,n} h[m,n,c_i,c_o] \cdot x[i+m,j+n,c_i]\]
where \(\mathcal{G}(c_o)\) is the input group for output \(c_o\).
# PyTorch implementation
conv_grouped = nn.Conv2d(
in_channels=128,
out_channels=128,
kernel_size=3,
groups=8 # 16 ch per group
)
# Parameters: 3×3×16×16×8 = 18,432
# Standard: 3×3×128×128 = 147,456Reduction factor = \(g\) (here: 8×)

Groups don’t communicate → limited representation
Deterministic channel permutation:


ShuffleNet 1.5× matches MobileNet accuracy with 2× fewer operations
Given compound coefficient \(\phi\):
\[\begin{align} \text{depth}: \quad d &= \alpha^\phi \\ \text{width}: \quad w &= \beta^\phi \\ \text{resolution}: \quad r &= \gamma^\phi \end{align}\]
Constraint: \(\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2\)
(FLOPS \(\propto\) width² × resolution²)
For EfficientNet: \(\alpha = 1.2\), \(\beta = 1.1\), \(\gamma = 1.15\)

Discovered through hardware-aware NAS:
\[h\text{-}swish(x) = x \cdot \frac{\text{ReLU6}(x+3)}{6}\]
Approximates swish but cheaper:


ConvNeXt (2022): “Modernized” ResNet with depthwise conv, larger kernels (7×7), and transformer-inspired designs. Achieves 87.8% ImageNet with standard convolutions.
EfficientNetV2 (2021): Progressive training with Fused-MBConv blocks. Trains 5-11× faster than V1.

Memory bandwidth often dominates latency on edge devices, not arithmetic operations
For a network with \(L\) layers: \[\frac{\partial \mathcal{L}}{\partial x_1} = \prod_{\ell=1}^{L} \frac{\partial f_\ell}{\partial x_{\ell-1}}\]
If \(\left|\frac{\partial f_\ell}{\partial x_{\ell-1}}\right| < 1\): \[\left|\frac{\partial \mathcal{L}}{\partial x_1}\right| \approx \gamma^L \rightarrow 0\]
Skip connections: \(x_\ell = \mathcal{F}_\ell(x_{\ell-1}) + x_{\ell-1}\)
However, ResNet-1001 performs worse than ResNet-101.

Forward: \[x_\ell = \mathcal{F}_\ell(x_{\ell-1}) + x_{\ell-1}\]
Backward: \[\frac{\partial \mathcal{L}}{\partial x_{\ell-1}} = \frac{\partial \mathcal{L}}{\partial x_\ell} \left(1 + \frac{\partial \mathcal{F}_\ell}{\partial x_{\ell-1}}\right)\]
During training:

Addition can wash out earlier features
Instead of addition: \[x_\ell = H_\ell([x_0, x_1, ..., x_{\ell-1}])\]
where \([\cdot]\) denotes concatenation.
Direct connections: Layer \(\ell\) receives feature maps from all preceding layers
Gradient flow: \[\frac{\partial \mathcal{L}}{\partial x_i} = \frac{\partial \mathcal{L}}{\partial x_L} \frac{\partial x_L}{\partial x_i} + \sum_{j=i+1}^{L-1} \frac{\partial \mathcal{L}}{\partial x_j} \frac{\partial x_j}{\partial x_i}\]
\(L(L+1)/2\) connections in an \(L\)-layer network


Standard DenseNet-B (Bottleneck):
Pre-activation design rationale:

Layer \(\ell\) in a dense block:
For an \(L\)-layer block: \[\text{Total channels} = k_0 + k \times L\]
Feature maps stored for concatenation: \[\text{Memory} \propto \sum_{\ell=1}^{L} (k_0 + k\ell) = k_0 L + \frac{kL(L+1)}{2}\]

Purpose: Control model complexity
If block outputs \(m\) feature maps:
Compression is critical for deep networks


DenseNet achieves comparable accuracy with significantly fewer parameters
Must maintain all intermediate features:
Shared memory allocations:
# Concatenate in-place
features = [x0]
for layer in dense_block:
new_features = layer(torch.cat(features, 1))
features.append(new_features)
return torch.cat(features, 1)Memory-efficient variant: Recompute activations during backward pass (trading compute for memory)

Loss gradient to layer \(\ell\): \[\frac{\partial \mathcal{L}}{\partial x_\ell} = \frac{\partial \mathcal{L}}{\partial x_L} \frac{\partial x_L}{\partial x_\ell} + \sum_{s=\ell+1}^{L-1} \frac{\partial \mathcal{L}}{\partial H_s} \frac{\partial H_s}{\partial x_\ell}\]
Each layer receives:
Shorter paths to loss function → stronger gradient signal
Average path length: \(\frac{L+1}{2}\) (vs \(L\) in sequential)


Later layers actively use features from all depths, not just immediate predecessors
ResNet-101 (44.5M params):
DenseNet-201 (20M params):
Dense connections act as implicit regularization:

DenseNet-201: 4.3B FLOPs ResNet-101: 7.8B FLOPs
Despite more connections, fewer FLOPs due to:
Concatenation overhead:
GPU utilization:


Dense connectivity principle spawned multiple architectural innovations
Classification: \(f: \mathbb{R}^{H \times W \times 3} \rightarrow \{1, ..., C\}\)
Classification + Localization: \[f: \mathbb{R}^{H \times W \times 3} \rightarrow \{1, ..., C\} \times \mathbb{R}^4\]
Object Detection: \[f: \mathbb{R}^{H \times W \times 3} \rightarrow \mathcal{P}(\{1, ..., C\} \times \mathbb{R}^4)\]
Segmentation: \[f: \mathbb{R}^{H \times W \times 3} \rightarrow \{1, ..., C\}^{H' \times W'}\]
where \(\mathcal{P}\) denotes power set (variable number of outputs)

| Task | Output Space | Dimension |
|---|---|---|
| Classification | \(\{1, ..., C\}\) | \(C\) |
| Localization | \(\{1, ..., C\} \times \mathbb{R}^4\) | \(C + 4\) |
| Detection | \(\bigcup_{n=0}^{N_{max}} (\{1, ..., C\} \times \mathbb{R}^4)^n\) | Variable |
| Segmentation | \(\{1, ..., C\}^{H \times W}\) | \(C \times H \times W\) |
| Instance Seg. | \(\{0, ..., N\}^{H \times W}\) | \((N+1) \times H \times W\) |
Detection is uniquely challenging: unknown cardinality

Direct coordinates: \((x_{min}, y_{min}, x_{max}, y_{max})\)
Center + size: \((x_c, y_c, w, h)\)
\[\mathcal{L} = \mathcal{L}_{cls}(c, \hat{c}) + \lambda \cdot \mathbb{1}_{c > 0} \cdot \mathcal{L}_{loc}(b, \hat{b})\]
where \(\mathbb{1}_{c > 0}\) indicates object presence
\[\mathcal{L}_{L2} = ||b - \hat{b}||^2\]
Scale dependent: large boxes dominate loss

Sliding Window:
Region Proposals:


Key Challenges:



Anchor Box Strategy:
\[\text{IoU}(B_1, B_2) = \frac{|B_1 \cap B_2|}{|B_1 \cup B_2|}\]

Anchor parameters:
Total anchors = \(|S| \times |R| \times \frac{H}{k} \times \frac{W}{k}\)
Given anchor \((x_a, y_a, w_a, h_a)\):
\[\begin{align} t_x &= (x - x_a) / w_a \\ t_y &= (y - y_a) / h_a \\ t_w &= \log(w / w_a) \\ t_h &= \log(h / h_a) \end{align}\]
Network predicts \((t_x, t_y, t_w, t_h)\)

\[\mathcal{L} = \frac{1}{N_{cls}} \sum_i \mathcal{L}_{cls}(p_i, p_i^*) + \lambda \frac{1}{N_{reg}} \sum_i p_i^* \mathcal{L}_{reg}(t_i, t_i^*)\]
where:
\[\mathcal{L}_{focal} = -\alpha (1-p_t)^\gamma \log(p_t)\]

Semantic: Each pixel → class label \[y[i,j] \in \{0, 1, ..., C\}\]
Instance: Each pixel → object ID \[y[i,j] \in \{0, 1, ..., N\}\]
Requires 32× upsampling.

\[\mathcal{L}_{CE} = -\frac{1}{HW} \sum_{i,j} \sum_{c} y_{ijc} \log \hat{y}_{ijc}\]
\[\mathcal{L}_{Dice} = 1 - \frac{2|Y \cap \hat{Y}|}{|Y| + |\hat{Y}|}\]
Approximated as: \[\mathcal{L}_{Dice} = 1 - \frac{2\sum_{ij} y_{ij}\hat{y}_{ij} + \epsilon}{\sum_{ij} y_{ij} + \sum_{ij} \hat{y}_{ij} + \epsilon}\]
Better for imbalanced classes

For each class:
\[\text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c\]
Often reported at specific IoU (mAP@0.5, mAP@[.5:.95])
\[\text{mIoU} = \frac{1}{C} \sum_{c=1}^{C} \frac{TP_c}{TP_c + FP_c + FN_c}\]
where intersection/union computed per class

Given image \(I \in \mathbb{R}^{H \times W \times 3}\)
Predict keypoints \(\{p_k\}_{k=1}^K\) where \(p_k = (x_k, y_k)\)

Network directly predicts coordinates: \[\hat{p}_k = f_\theta(I) \in \mathbb{R}^2\]
Loss: \(\mathcal{L} = \sum_k ||\hat{p}_k - p_k^*||^2\)
Empirical result: ~20% lower accuracy than heatmap methods

Output: \(H \in \mathbb{R}^{H' \times W' \times K}\)
Each channel \(H_k\) represents probability that keypoint \(k\) exists at each location.
Ground truth heatmap for keypoint \(k\) at \(p_k^*\):
\[G_k(p) = \exp\left(-\frac{||p - p_k^*||^2}{2\sigma^2}\right)\]
where \(\sigma\) controls spatial precision
Typical: \(\sigma = 1\) pixel at output resolution

\[\mathcal{L} = \frac{1}{KHW} \sum_{k=1}^K \sum_{i,j} (G_k[i,j] - \hat{H}_k[i,j])^2\]
Extract keypoint locations: \[\hat{p}_k = \arg\max_{(i,j)} \hat{H}_k[i,j]\]
Sub-pixel refinement:

Repeated encoding-decoding:
Loss at each hourglass output: \[\mathcal{L} = \sum_{t=1}^T \lambda_t \mathcal{L}_t\]
where \(T\) = number of hourglasses
Typical: \(T=2\) to \(T=8\), \(\lambda_t = 1\)

Complexity: \(O(N \cdot T_{pose})\)
Complexity: \(O(T_{detect} + N^2)\) for association
\(N\) = number of people

Between keypoints \(j_1\) and \(j_2\), define:
\[\mathbf{L}(p) = \begin{cases} \mathbf{v} & \text{if } p \text{ on limb} \\ \mathbf{0} & \text{otherwise} \end{cases}\]
where \(\mathbf{v} = \frac{j_2 - j_1}{||j_2 - j_1||}\) (unit vector)
Score for connecting candidates \(d_{j_1}\) and \(d_{j_2}\):
\[E = \int_0^1 \mathbf{L}(p(u)) \cdot \frac{d_{j_2} - d_{j_1}}{||d_{j_2} - d_{j_1}||} du\]
Approximated by sampling along line

Given detected parts, find optimal grouping:
\[\max \sum_{c \in \mathcal{C}} \sum_{(j_1, j_2) \in c} E_{j_1,j_2}\]
subject to: no part used twice
Greedy: Sort by PAF score, assign sequentially
Hungarian Algorithm: Optimal bipartite matching
In practice: Greedy works well (OpenPose)

2D → 3D Lifting:
Direct 3D Prediction:
Multiple 3D poses project to same 2D: \[\pi(P_{3D}) = p_{2D}\]
Need priors or multiple views

Stage 1: Where might objects be?
Stage 2: What is in each region?
Reduces search space from ~10⁶ sliding windows to ~10³ proposals


Approach: Apply CNN to region proposals
Pipeline:
# Pseudocode
for image in dataset:
regions = selective_search(image) # ~2000 regions
for region in regions: # Sequential!
warped = resize_to_227x227(region)
features = alexnet.forward(warped) # Full forward pass
class_scores = svm_classify(features)
bbox_refined = bbox_regressor(features)Limitations:
Training Complexity:

Each region processed independently → no shared computation
Stage 1: Pre-training
Stage 2: Fine-tuning
Stage 3: SVM Training
Stage 4: Bounding Box Regression

This complexity stems from historical constraints: end-to-end fine-tuning was not well understood
| Component | Time | Percentage |
|---|---|---|
| Selective Search | 2s | 4% |
| Feature Extraction | 45s | 96% |
| - 2000 CNN passes | ||
| - No sharing | ||
| SVM + BBox | 0.3s | <1% |
Takeaway: Repeated feature computation dominates


Improvment: Share convolutional computation
# Pseudocode
for image in dataset:
# Single forward pass
feature_map = conv_layers.forward(image)
regions = selective_search(image)
for region in regions:
# Project to feature coordinates
roi_features = roi_pool(feature_map, region)
class_scores, bbox_deltas = fc_layers(roi_features)RoI Pooling: Spatial pyramid pooling variant
Joint objective function: \[\mathcal{L} = \mathcal{L}_{cls}(p, u) + \lambda[u \geq 1]\mathcal{L}_{bbox}(t^u, v)\]
Performance: 0.32s/image (146× speedup)
mAP improvement: 66% → 70% on PASCAL VOC

Input: Feature map \(F \in \mathbb{R}^{H \times W \times C}\) Region proposal: \((x, y, w, h)\) in image coords
Output: Fixed size \(\mathbb{R}^{H' \times W' \times C}\) (typically \(7 \times 7\))
\[\tilde{x} = \lfloor x / \text{stride} \rfloor\]
Misalignment accumulates significantly.

\[L(p, u, t^u, v) = L_{cls}(p, u) + \lambda [u \geq 1] L_{loc}(t^u, v)\]
where:
\[L_{loc}(t, v) = \sum_{i \in \{x,y,w,h\}} \text{smooth}_{L1}(t_i - v_i)\]
\[\text{smooth}_{L1}(x) = \begin{cases} 0.5x^2 & \text{if } |x| < 1 \\ |x| - 0.5 & \text{otherwise} \end{cases}\]


Objective: Learn region proposals with a CNN

Innovation: Region Proposal Network (RPN)
RPN Components:
Two-Stage Architecture:
Stage 1: RPN
Stage 2: Detection
4-Way Loss Function:
Speed: 0.2s/image (10 FPS) | Accuracy: State-of-the-art at release

Positive (object):
Negative (background):
Ignored:
\[L = \frac{1}{N_{cls}} \sum_i L_{cls}(p_i, p_i^*) + \lambda \frac{1}{N_{reg}} \sum_i p_i^* L_{reg}(t_i, t_i^*)\]

Approach 1: Image pyramids
Approach 2: Feature pyramids
Approach 3: Anchor pyramids

Single object → multiple overlapping detections
def nms(boxes, scores, threshold):
# Sort by confidence
indices = argsort(scores, descending=True)
keep = []
while indices:
# Keep highest scoring box
i = indices[0]
keep.append(i)
# Compute IoU with remaining
ious = compute_iou(boxes[i], boxes[indices[1:]])
# Remove high overlap boxes
indices = indices[1:][ious < threshold]
return keepThreshold selection:

Computational cost: O(N²) for N detections per class
Modern variants: Soft-NMS, Adaptive-NMS, Learned-NMS

Enhancement to Faster R-CNN (Lin et al., CVPR 2017)

Problem: Faster R-CNN uses single-scale features (C₄ or C₅) → poor for small objects
FPN: Multi-scale feature maps with strong semantics at all scales via top-down pathway
Gain: +2.3 mAP on COCO with minimal speed impact (5-10% slower)
Bottom-up pathway:
Top-down pathway:
Mathematical formulation: \[P_\ell = \text{Conv}_{1 \times 1}(C_\ell) + \text{Upsample}(P_{\ell+1})\]
Feature dimension: All P layers → 256 channels


Progressive refinement with increasing IoU thresholds:
Each stage specialized for its threshold
Learnable sampling offsets: \[y(p) = \sum_{k=1}^K w_k \cdot x(p + p_k + \Delta p_k)\]
Adaptive receptive fields for objects

Region-based methods continue to achieve state-of-the-art accuracy on detection benchmarks
Faster R-CNN (2015): ~5 FPS on GPU
Real-time requirement: 30+ FPS (33ms)
For 300 proposals:
Eliminating region-wise computation through unified detection.

Divide image into \(S \times S\) grid (typically \(7 \times 7\))
Each grid cell:
\[\text{Output}: S \times S \times (B \times 5 + C)\]
where each box has:
Single forward pass → all detections

For YOLO v1 with \(B=2\) boxes, \(C=20\) classes:
Bounding boxes (per box):
Class probabilities (shared per cell):
\[P(c_i | \text{box}_j) = P(c_i | \text{Object}) \times P(\text{Object}) \times \text{IoU}\]
Threshold and apply NMS for final detections

\[\mathcal{L} = \mathcal{L}_{\text{coord}} + \mathcal{L}_{\text{conf}} + \mathcal{L}_{\text{class}}\]
Localization loss (only if object present): \[\lambda_{\text{coord}} \sum_{i}^{S^2} \sum_{j}^{B} \mathbb{1}_{ij}^{\text{obj}} \left[(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2\right]\]
\[+ \lambda_{\text{coord}} \sum_{i}^{S^2} \sum_{j}^{B} \mathbb{1}_{ij}^{\text{obj}} \left[(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2\right]\]
Confidence loss: \[\sum_{i}^{S^2} \sum_{j}^{B} \mathbb{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2 + \lambda_{\text{noobj}} \sum_{i}^{S^2} \sum_{j}^{B} \mathbb{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2\]
Classification loss: \[\sum_{i}^{S^2} \mathbb{1}_i^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2\]
Square root for width/height:
Loss weights:
Problems with squared error:

24 convolutional layers + 2 fully connected
Inspired by GoogLeNet (but simpler):
Pretrain on ImageNet (224×224)
Detection fine-tuning (448×448)

Single 7×7 feature map:
Predictions from multiple feature maps:
Each location: multiple default boxes with different scales/ratios

For \(m\) prediction layers, scale at layer \(k\): \[s_k = s_{\min} + \frac{s_{\max} - s_{\min}}{m - 1}(k - 1)\]
where \(s_{\min} = 0.2\), \(s_{\max} = 0.9\)
For ratios \(a_r \in \{1, 2, 3, 1/2, 1/3\}\):
Additional box: \(s'_k = \sqrt{s_k s_{k+1}}\) for \(a_r = 1\)
38×38×4 + 19×19×6 + 10×10×6 + 5×5×6 + 3×3×4 + 1×1×4 = 8732 default boxes


For each ground truth box:
Problem: ~8700 default boxes, <10 positive
Solution:
Only computed for positive matches: \[t_x = (g_x - d_x)/d_w, \quad t_y = (g_y - d_y)/d_h\] \[t_w = \log(g_w/d_w), \quad t_h = \log(g_h/d_h)\]
where \(g\) = ground truth, \(d\) = default box

\[L(x, c, l, g) = \frac{1}{N}(L_{\text{conf}}(x, c) + \alpha L_{\text{loc}}(x, l, g))\]
where \(N\) = number of matched default boxes
Smooth L1 between predicted (\(l\)) and target (\(g\)): \[L_{\text{loc}} = \sum_{i \in \text{Pos}} \sum_{m \in \{x,y,w,h\}} \text{smooth}_{L1}(l_i^m - g_i^m)\]
Softmax loss over classes: \[L_{\text{conf}} = -\sum_{i \in \text{Pos}} x_{ij}^p \log(\hat{c}_i^p) - \sum_{i \in \text{Neg}} \log(\hat{c}_i^0)\]
where \(\hat{c}_i^0\) is background class probability
\(\alpha = 1\) in practice (equal weighting)

Dense detectors evaluate ~100k locations:
Cross-entropy for easy negatives:
Example: \(p = 0.9\) for easy negative

Standard Cross-Entropy: \[\text{CE}(p_t) = -\log(p_t)\]
Focal Loss: \[\text{FL}(p_t) = -\alpha_t(1 - p_t)^\gamma \log(p_t)\]
where:
\[\frac{\partial \text{FL}}{\partial p} = -\alpha(1-p_t)^\gamma \left(\gamma p_t \log(p_t) + (1-p_t)\right)\]
As \(p_t \rightarrow 1\): gradient \(\rightarrow 0\) rapidly


Key insight: FLOPs ≠ Speed. Memory access patterns and parallelization matter.
Architectural Changes:
Loss Design:
Performance (COCO):
Model scales from 3M to 68M parameters
Anchors have hyperparameters:
Predict object centers as keypoints:
For each location \((x, y)\):

~100k predictions compensate for single pass:
Modern backbones extract semantic features:
Better handling of imbalance:
Hardware optimization for single forward pass

Single-shot detectors prioritize efficiency over architectural complexity, achieving real-time performance through dense predictions and sophisticated loss functions.
Input: \(X \in \mathbb{R}^{H \times W \times 3}\)
Semantic Segmentation: \[f: \mathbb{R}^{H \times W \times 3} \rightarrow \{0, 1, ..., C\}^{H' \times W'}\]
Instance Segmentation: \[f: \mathbb{R}^{H \times W \times 3} \rightarrow (\{0, 1\}^{H' \times W'})^N\]
where \(N\) is the number of instances.
For 512×512 image: 262,144 predictions


Semantic: All pixels labeled by class

Instance: Each object gets unique ID
Standard CNN (ResNet-50):
32× spatial reduction
High resolution → High memory/computation Low resolution → Lost spatial detail
Need: Semantic understanding + spatial precision

FC layer with 4096 neurons on 7×7×512 input:
FC layer: \[y_i = \sum_{j} W_{ij} x_j + b_i\]
As 1×1 convolution after global pool: \[y[0,0,i] = \sum_{h,w,c} W[h,w,c,i] \cdot x[h,w,c] + b[i]\]
Once fully convolutional:

Non-learnable, deterministic: \[y[i,j] = \sum_{m,n} x[\lfloor i/s \rfloor + m, \lfloor j/s \rfloor + n] \cdot W_{mn}\]
where \(W_{mn}\) are bilinear weights based on fractional positions.
Learnable upsampling: \[y[si+m, sj+n] = \sum_{c} h[m,n,c] \cdot x[i,j,c]\]
Equivalent to:
Transposed conv with stride > kernel size causes overlapping patterns

FCN-32s: Direct 32× upsampling
FCN-16s: Combine pool4 + pool5
FCN-8s: Combine pool3 + pool4 + pool5
| Model | Mean IoU | Inference (ms) |
|---|---|---|
| FCN-32s | 59.4 | 80 |
| FCN-16s | 62.4 | 85 |
| FCN-8s | 62.7 | 90 |
Skip connections recover fine details with minimal computational cost

Contracting Path (left):
Expanding Path (right):
Addition (FCN): Features compete Concatenation (U-Net): Network chooses
Total parameters: ~31M (for 64 initial filters)


Concatenation doubles channels at each decoder level, increasing expressiveness
Standard convolution: \[y[i,j] = \sum_{m,n} h[m,n] \cdot x[i+m, j+n]\]
Atrous convolution with rate \(r\): \[y[i,j] = \sum_{m,n} h[m,n] \cdot x[i+rm, j+rn]\]
Parameters remain constant: \(k^2\)
Replace stride/pooling with dilation to maintain resolution

Parallel branches with different rates:
Concatenate all branches → 1×1 conv to fuse
\[F_{out} = Conv_{1×1}\left(\bigparallel_{r \in \{1,6,12,18\}} ASPP_r(F_{in}) \parallel GAP(F_{in})\right)\]
where \(\parallel\) denotes concatenation.

\[E(x) = \sum_i \psi_u(x_i) + \sum_{i,j} \psi_p(x_i, x_j)\]
Unary potential (from CNN): \[\psi_u(x_i) = -\log P(x_i)\]
Pairwise potential (smoothness): \[\psi_p(x_i, x_j) = \mu(x_i, x_j) \sum_{m} w^{(m)} k^{(m)}(f_i, f_j)\]
where kernels based on:
Iterative update (5-10 iterations): \[Q_i(l) = \frac{1}{Z_i} \exp\{-\psi_u(l) - \sum_j \psi_p(l, Q_j)\}\]

CRF refines boundaries using image appearance
Add mask branch parallel to box/class heads:
Box head: 7×7×256 → FC → class + box Mask head: 14×14×256 → Conv → 28×28×C
RoIPool: Quantization to integer coordinates RoIAlign: Bilinear interpolation at exact locations
\[\text{RoIAlign}(x, y) = \sum_{i,j} x_{ij} \cdot \max(0, 1 - |x - i|) \cdot \max(0, 1 - |y - j|)\]
Binary cross-entropy per class: \[\mathcal{L}_{mask} = -\frac{1}{m^2} \sum_{i,j} [y_{ij}^k \log \hat{y}_{ij}^k + (1-y_{ij}^k)\log(1-\hat{y}_{ij}^k)]\]
Only computed for ground truth class \(k\)

RoI Pool:
RoIAlign:


Multi-task learning: \(\mathcal{L} = \mathcal{L}_{cls} + \mathcal{L}_{box} + \mathcal{L}_{mask}\)
\[\mathcal{L}_{CE} = -\frac{1}{HW}\sum_{i,j} \sum_{c} y_{ijc} \log(\hat{y}_{ijc})\]
Problem: Class imbalance (background >> foreground)
\[\mathcal{L}_{Dice} = 1 - \frac{2\sum_{i,j} y_{ij}\hat{y}_{ij} + \epsilon}{\sum_{i,j} y_{ij} + \sum_{i,j} \hat{y}_{ij} + \epsilon}\]
Directly optimizes IoU-like metric
\[\mathcal{L}_{Focal} = -\alpha(1-\hat{y}_{ij})^\gamma y_{ij}\log(\hat{y}_{ij})\]
Down-weights easy pixels
Weight pixels by distance to boundary: \[w_{ij} = 1 + \alpha \cdot \exp\left(-\frac{d_{ij}^2}{2\sigma^2}\right)\]

\[\text{mIoU} = \frac{1}{C} \sum_{c=1}^{C} \frac{TP_c}{TP_c + FP_c + FN_c}\]
where for each class \(c\):
\[\text{FWIoU} = \frac{1}{\sum_c n_c} \sum_{c} n_c \cdot \text{IoU}_c\]
Average Precision at different IoU thresholds:

Stuff: Uncountable regions (sky, road, grass) Things: Countable objects (cars, people)
\[PQ = \frac{\sum_{(p,g) \in TP} \text{IoU}(p,g)}{|TP| + \frac{1}{2}|FP| + \frac{1}{2}|FN|}\]
Decomposes into: \[PQ = \underbrace{\frac{\sum_{(p,g) \in TP} \text{IoU}(p,g)}{|TP|}}_{\text{Segmentation Quality}} \times \underbrace{\frac{|TP|}{|TP| + \frac{1}{2}|FP| + \frac{1}{2}|FN|}}_{\text{Recognition Quality}}\]
Single network for both tasks:

Two pathways:
Spatial Path: High resolution, shallow
Context Path: Low resolution, deep
Fusion with Feature Fusion Module (FFM)
| Model | mIoU | FPS | Params |
|---|---|---|---|
| FCN-8s | 65.3 | 5 | 134M |
| U-Net | 67.8 | 15 | 31M |
| BiSeNet | 68.4 | 65 | 13M |
| ENet | 58.3 | 135 | 0.4M |

Maintains high-resolution representations:
\[F_s^{(l+1)} = \sum_{r=1}^{S} \mathcal{T}_{r \rightarrow s}(F_r^{(l)})\]
where \(\mathcal{T}\) is resolution transformation.
Treat segmentation as rendering:
Foundation model for segmentation:
Note: Transformer-based, covered in later lectures

Segmentation has evolved from simple upsampling to sophisticated multi-scale reasoning and iterative refinement
Voxels: Regular 3D grid
Point Clouds: Unordered sets
Meshes: Vertices + faces

Standard 3D convolution: \[y[d,i,j] = \sum_{c} \sum_{k,m,n} w[k,m,n,c] \cdot x[d+k, i+m, j+n, c]\]
Computational complexity: \[O(K^3 \cdot C_{in} \cdot C_{out} \cdot D \cdot H \cdot W)\]
# PyTorch 3D convolution
conv3d = nn.Conv3d(
in_channels=32,
out_channels=64,
kernel_size=3,
padding=1
)
# Input: (B, 32, D, H, W)
# Output: (B, 64, D, H, W)Only ~5% voxels occupied → sparse convolutions

Sparsity: computational necessity at high resolutions
Point cloud: \(\mathcal{P} = \{p_i\}_{i=1}^N\) where \(p_i \in \mathbb{R}^3\)
Required: \(f(\{p_1, ..., p_N\}) = f(\pi(\{p_1, ..., p_N\}))\)
for any permutation \(\pi\).
\[f(\{x_1, ..., x_n\}) = \gamma \left( \max_{i=1,...,n} h(x_i) \right)\]
where:
Achieves permutation invariance through symmetric function

T-Net: Learn input transformation
Feature Extraction:
class PointNetEncoder(nn.Module):
def __init__(self):
self.conv1 = nn.Conv1d(3, 64, 1)
self.conv2 = nn.Conv1d(64, 128, 1)
self.conv3 = nn.Conv1d(128, 1024, 1)
def forward(self, x):
# x: (B, 3, N)
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = self.conv3(x)
# Global max pooling
x = torch.max(x, 2)[0]
return x # (B, 1024)
Only subset of points contribute to each feature dimension
Build k-NN graph dynamically: \[\mathbf{e}_{ij} = h_\Theta(\mathbf{x}_i, \mathbf{x}_j - \mathbf{x}_i)\]
Edge convolution: \[\mathbf{x}'_i = \max_{j \in \mathcal{N}(i)} \mathbf{e}_{ij}\]
Place kernel points in 3D space: \[y_i = \sum_{j \in \mathcal{N}(i)} \sum_{k=1}^K h(\mathbf{x}_j - \mathbf{x}_i, \tilde{\mathbf{x}}_k) W_k x_j\]

KITTI benchmark: Cars @ 0.7 IoU

Graph structure: \(\mathcal{G} = (\mathcal{V}, \mathcal{E})\)
Spectral convolution: \[y = U g_\theta U^T x\]
where \(U\) are eigenvectors of graph Laplacian.
Spatial convolution (MoNet): \[y_i = \sum_{j \in \mathcal{N}(i)} w(\mathbf{u}_{ij}) x_j\]
MVCNN: 90.1% accuracy on ModelNet40 (vs 77% for voxels, 89.2% for PointNet)

Output: \(D \in \mathbb{R}^{H \times W}\)
Scale-invariant loss: \[\mathcal{L}_{si} = \frac{1}{n}\sum_i d_i^2 - \frac{\lambda}{n^2}\left(\sum_i d_i\right)^2\]
where \(d_i = \log \hat{D}_i - \log D_i\)
Ordinal loss (relative depth): \[\mathcal{L}_{ord} = \sum_{(i,j) \in \mathcal{P}} -\log \sigma(z_i - z_j)\]
where \(\mathcal{P}\) are pairs with known ordering
Structure from motion constraint: \[I_{t+1} \approx \text{warp}(I_t, D_t, T_{t \rightarrow t+1})\]
Photometric loss: \[\mathcal{L}_{photo} = ||I_{t+1} - \hat{I}_{t+1}||_1 + \text{SSIM}(I_{t+1}, \hat{I}_{t+1})\]

Self-supervision enables training on unlimited video data without depth labels