Value and Q-Iteration on Stochastic Gridworld

Implement Value Iteration and Q-Iteration to solve a stochastic gridworld navigation problem. A rescue robot must reach a goal location while navigating hazardous terrain with uncertain movement dynamics.

Environment Dynamics

The robot operates in a 5×5 grid with stochastic transitions:

0.8 probability: Move in intended direction
0.1 probability: Drift perpendicular left
0.1 probability: Drift perpendicular right

Collision with boundaries or obstacles results in no movement. The environment provides sparse rewards—most transitions incur small negative cost, while specific cells yield large positive or negative rewards upon entry.

Environment:

Goal (G): +10 reward, terminates episode
Penalty (P): -5 reward
Obstacles (X): Impassable cells
Regular movement: -0.1 cost per step

Parameters:

Episode limit: 50 steps maximum

Dynamic Programming Framework

Value Iteration operates in state space while Q-Iteration operates in state-action space.

Value Iteration

The optimal state-value function \(V^*(s)\) is the maximum expected return from state \(s\):

\[V^*(s) = \max_{a \in \mathcal{A}} \sum_{s' \in \mathcal{S}} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right]\]

The update rule:

\[V_{k+1}(s) \leftarrow \max_{a} \mathbb{E}_{s' \sim P(\cdot|s,a)} \left[ R(s,a,s') + \gamma V_k(s') \right]\]

Convergence is guaranteed for \(\gamma < 1\).

Q-Iteration

The action-value function \(Q^*(s,a)\) is the expected return when taking action \(a\) in state \(s\) then following the optimal policy:

\[Q^*(s,a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q^*(s',a') \right]\]

Policy extraction: \(\pi^*(s) = \arg\max_a Q^*(s,a)\)

Update rule: \[Q_{k+1}(s,a) \leftarrow \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q_k(s',a') \right]\]

Implementation Details

For each state-action pair, consider all possible next states weighted by transition probabilities. With action \(a = \text{UP}\):

\(P(s_{\text{north}}|s, \text{UP}) = 0.8\) (intended)
\(P(s_{\text{west}}|s, \text{UP}) = 0.1\) (drift left)
\(P(s_{\text{east}}|s, \text{UP}) = 0.1\) (drift right)

If a transition would hit a wall or obstacle, the agent stays in the current state.

Convergence criterion: \[\|V_{k+1} - V_k\|_\infty = \max_{s \in \mathcal{S}} |V_{k+1}(s) - V_k(s)| < \epsilon\]

Use \(\epsilon = 10^{-4}\).

Tasks

You may use the provided starter code. Use discount factor γ = 0.95.

Implement the gridworld environment in Python using only NumPy and the standard Python libraries
Implement Value Iteration to compute the optimal state-value function using the Bellman optimality equation
Implement Q-Iteration to compute the optimal state-action value function
Derive the optimal policy from the computed value functions
Visualize the resulting optimal value function and policy on the grid. Display the value of each cell using a heatmap, and overlay arrows indicating the optimal action in each cell

Report Requirements

Your report for Problem 1 should include:

Number of iterations until convergence for both algorithms
Visualization of the final value function (heatmap)
Visualization of the optimal policy (arrows on grid)
Brief comparison of Value Iteration vs Q-Iteration convergence rates
Discussion of how the stochastic transitions affect the optimal policy