This demo shows a simple reinforcement learning (RL) agent learning to navigate a 5x5 grid world. The agent starts at the top-left corner (0,0) and aims to reach the goal at the bottom-right corner (4,4).
The agent learns using Q-learning, a type of RL algorithm. Each cell shows the Q-values for moving in each direction (up, right, down, left). Higher Q-values indicate more promising actions.
Q-values are updated using the Q-learning algorithm:
Q(s, a) = Q(s, a) + α * [R + γ * max(Q(s', a')) - Q(s, a)]
Where:
This update happens after each action the agent takes, gradually improving its policy.
In this demo, the reward is determined as follows:
This reward structure encourages the agent to reach the goal as quickly as possible while minimizing unnecessary steps.
Note: The reward is not based on the distance to the goal. This simple structure is common in introductory RL problems.