RL Grid World Demo

How it works:

This demo shows a simple reinforcement learning (RL) agent learning to navigate a 5x5 grid world. The agent starts at the top-left corner (0,0) and aims to reach the goal at the bottom-right corner (4,4).

The agent learns using Q-learning, a type of RL algorithm. Each cell shows the Q-values for moving in each direction (up, right, down, left). Higher Q-values indicate more promising actions.

Controls:

Step: Make the agent take one action based on its current policy.
Train: Run 1000 episodes of training to improve the agent's policy.
Demonstrate: Show the agent's learned behavior by moving to the goal.

Grid Legend:

A: Agent's current position
G: Goal position
Numbers: Q-values for each action (up, right, down, left)
Magenta border: Path taken during demonstration

How Q-values are updated:

Q-values are updated using the Q-learning algorithm:

Q(s, a) = Q(s, a) + α * [R + γ * max(Q(s', a')) - Q(s, a)]

Where:

Q(s, a) is the Q-value for the current state s and action a
α (alpha) is the learning rate (${LEARNING_RATE} in this demo)
R is the reward received for taking action a in state s
γ (gamma) is the discount factor (${DISCOUNT_FACTOR} in this demo)
max(Q(s', a')) is the maximum Q-value for the next state s' over all possible actions a'

This update happens after each action the agent takes, gradually improving its policy.

Reward Structure:

In this demo, the reward is determined as follows:

Reaching the goal (4,4): +1 reward
Any other action: -0.1 reward

This reward structure encourages the agent to reach the goal as quickly as possible while minimizing unnecessary steps.

Note: The reward is not based on the distance to the goal. This simple structure is common in introductory RL problems.

Reinforcement Learning Grid World Demo

How it works:

Controls:

Grid Legend:

How Q-values are updated:

Reward Structure: