Reinforcement Learning Grid World Demo

How it works:

This demo shows a simple reinforcement learning (RL) agent learning to navigate a 5x5 grid world. The agent starts at the top-left corner (0,0) and aims to reach the goal at the bottom-right corner (4,4).

The agent learns using Q-learning, a type of RL algorithm. Each cell shows the Q-values for moving in each direction (up, right, down, left). Higher Q-values indicate more promising actions.

Controls:

Grid Legend:

How Q-values are updated:

Q-values are updated using the Q-learning algorithm:

Q(s, a) = Q(s, a) + α * [R + γ * max(Q(s', a')) - Q(s, a)]

Where:

This update happens after each action the agent takes, gradually improving its policy.

Reward Structure:

In this demo, the reward is determined as follows:

This reward structure encourages the agent to reach the goal as quickly as possible while minimizing unnecessary steps.

Note: The reward is not based on the distance to the goal. This simple structure is common in introductory RL problems.