Multi-Agent Coordination with Limited Information

In multi-agent systems with partial observability, agents must coordinate using limited information. This problem examines what information enables coordination through systematic ablation of available signals.

Problem Setting

Two agents navigate a 10×10 gridworld with partial observability, attempting to simultaneously occupy a target location. Each agent observes only a 3×3 local patch centered on its position. Additional information channels may include distance to target and inter-agent communication.

Environment:

Agents A and B: Must coordinate to reach target simultaneously
Target (T): Shared goal location
Obstacles (X): Impassable cells
Grid size: 10×10
Episode limit: 50 steps maximum

Observation and Action Spaces

Base Scenario: Independent Agents

Each agent receives a 9-dimensional observation vector:

Elements 0-8: Flattened 3×3 grid patch (row-major order)
- Cell values: 0 (free), 1 (obstacle), 2 (target), -1 (off-grid)
- Agent positions not explicitly marked in observations

In this scenario, agents operate independently with only local observations.

Scenario B: Communication Channel

Observation extends to 10 dimensions:

Elements 0-8: Same as base scenario
Element 9: Communication scalar from partner agent (bounded [0,1])

Agents can now share information through a learned communication protocol.

Scenario C: Distance and Communication

Observation extends to 11 dimensions:

Elements 0-8: Same as base scenario
Element 9: Communication scalar from partner agent
Element 10: Normalized distance between agents

Distance computation: \[d = \frac{\sqrt{(x_A - x_B)^2 + (y_A - y_B)^2}}{\sqrt{H^2 + W^2}}\]

where \((x_A, y_A)\) and \((x_B, y_B)\) are the positions of agents A and B, and \(H, W\) are grid dimensions. This distance represents signal strength in a beacon-like communication system. Normalization ensures the distance is bounded in [0,1] regardless of grid size, maintaining consistent input scale across different environments.

The action space combines movement and communication:

Movement actions: {Up, Down, Left, Right, Stay}
Communication output: Bounded scalar \(c \in [0, 1]\)

Deep Q-Network Architecture

The agent network processes partial observations to produce action-values and communication signals:

\[\begin{aligned} h &= \text{ReLU}(W_1 \cdot [\text{obs}, \text{comm}_{in}] + b_1) \\ Q(s,a) &= W_{action} \cdot h + b_{action} \quad \in \mathbb{R}^5 \\ c_{out} &= \sigma(W_{comm} \cdot h + b_{comm}) \quad \in [0, 1] \end{aligned}\]

Multi-Agent Q-Learning

The agents learn through independent Q-learning with shared experiences. The temporal difference target for agent \(i\):

\[y_i = r + \gamma \max_{a'} Q_i(s', a'; \theta^-_i)\]

where \(\theta^-\) represents target network parameters (updated periodically for stability). The loss combines both agents’ TD errors:

\[\mathcal{L} = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ (Q_A(s_A, a_A) - y_A)^2 + (Q_B(s_B, a_B) - y_B)^2 \right]\]

Experience replay \(\mathcal{D}\) stores tuples \((s_A, s_B, a_A, a_B, c_A, c_B, r, s'_A, s'_B)\) capturing joint experiences.

Reward Structure

Reward structure:

\[R(s,a) = \begin{cases} +10 & \text{if both agents on target} \\ +2 & \text{if one agent on target} \\ -0.1 & \text{step penalty} \end{cases}\]

The partial reward for single-agent arrival provides intermediate feedback to guide exploration, while the full reward requires coordination.

Agents must:

Discover the target location through exploration
Coordinate arrival timing
Learn which actions and communications led to success

Tasks

You may use the provided starter code, but must follow the interfaces exactly. Use discount factor γ = 0.99.

Implement the multi-agent environment following the MultiAgentEnv interface specification below
Implement the DQN architecture with dual outputs (Q-values and communication signal) following the AgentDQN interface
Implement experience replay with batched updates
Implement \(\epsilon\)-greedy exploration for action selection
Conduct ablation study using a unified 11-dimensional observation space and single network architecture. Control information availability through input masking:
- (a) Independent: Mask elements 9-10 with zeros
- (b) Communication Only: Mask element 10 with zero
- (c) Full Information: No masking

Train all three configurations and save models in TorchScript format

Report Requirements

Your report for Problem 2 should include:

Training hyperparameters (learning rate, batch size, epsilon schedule, replay buffer size)
Training curves for all three configurations showing average reward and success rate
Final success rates for each configuration
Comparison table of performance across configurations
Analysis of how distance information and communication affect coordination
Discussion of learned strategies in each configuration

Required Interfaces

The following interfaces must be implemented exactly as specified for autograding:

Environment Interface

class MultiAgentEnv:
    def __init__(self, grid_size=(10, 10), obs_window=3, max_steps=50):
        """
        Initialize the environment.

        Parameters:
            grid_size: Tuple defining grid dimensions (default 10x10)
            obs_window: Size of local observation window (must be odd, default 3)
            max_steps: Maximum steps per episode
        """

    def reset(self):
        """
        Reset environment to initial state.

        Returns:
            obs_A, obs_B: Tuple[np.ndarray, np.ndarray]
            Each observation is an 11-dimensional vector:
            - Elements 0-8: Flattened 3x3 grid patch (row-major order)
            - Element 9: Communication signal from partner
            - Element 10: Normalized L2 distance between agents
        """

    def step(self, action_A, action_B, comm_A, comm_B):
        """
        Take a step in the environment.

        Parameters:
            action_A, action_B: Discrete actions (0:Up, 1:Down, 2:Left, 3:Right, 4:Stay)
            comm_A, comm_B: Communication scalars from each agent

        Returns:
            (obs_A, obs_B), reward, done
            - reward: +10 if both agents at target, +2 if one agent at target, -0.1 per step
            - done: True if both agents at target or max steps reached
        """

Network Interface

class AgentDQN(nn.Module):
    def __init__(self, input_dim=11, hidden_dim=64, num_actions=5):
        """
        Initialize the DQN.

        Parameters:
            input_dim: Dimension of input vector (default 11)
            hidden_dim: Number of hidden units (default 64)
            num_actions: Number of discrete actions (default 5)
        """

    def forward(self, x):
        """
        Forward pass.

        Parameters:
            x: Tensor of shape [batch_size, input_dim]

        Returns:
            action_values: Tensor of shape [batch_size, num_actions]
            comm_signal: Tensor of shape [batch_size, 1]
        """