Skip to main content

Maze Navigation: From Classical Algorithms to Reinforcement Learning

Maze navigation is a fundamental problem in computer science and artificial intelligence. This example demonstrates the progression from classical graph search algorithms (BFS, DFS, Dijkstra) to reinforcement learning, and explains why RL is necessary when the environment is unknown or dynamic.

Algorithm Info

DFS (Depth-First Search)

  • Strategy: Explore deep, backtrack when stuck
  • Visualization: Shows exact exploration order
  • Note: Often finds a long, winding path

Legend:

🟩 Start🟥 Goal⬛ Wall🔵 Path🔷 Visited (Search Area)

Classical Algorithms: When You Know Everything

Depth-First Search (DFS)

DFS explores as far as possible along each branch before backtracking. It typically finds a path quickly, but not necessarily the shortest one.

Visualization:

  • Notice how DFS plunges deep into one path, hits a dead end, and backtracks.
  • It visits cells in a specific, winding order.

Time Complexity: O(V+E)O(V + E)

Key Assumption: You have complete knowledge of the maze structure.

Dijkstra's Algorithm

Dijkstra finds the shortest path in a weighted graph. In an unweighted maze like this, it expands uniformly like a "wavefront" from the start.

Visualization:

  • Notice how Dijkstra explores all nearby cells first before moving further away.
  • It guarantees finding the shortest path (fewest steps).

Time Complexity: O((V+E)logV)O((V + E) \log V)

Key Assumption: You have complete knowledge of the maze structure.

Why Reinforcement Learning?

Classical algorithms work perfectly when you have full knowledge of the environment. However, in many real-world scenarios, this assumption breaks down:

1. Unknown Environment

Problem: You don't know the maze layout beforehand.

  • BFS/DFS/Dijkstra: Require the full graph structure to work
  • RL Solution: Agent explores and learns the environment through trial and error

Example: A robot navigating an unknown building, a game agent in a procedurally generated world.

2. Stochastic Transitions

Problem: Actions don't always work as expected (e.g., 80% chance of moving forward, 20% chance of slipping).

  • BFS/DFS/Dijkstra: Assume deterministic transitions
  • RL Solution: Learns transition probabilities and optimizes expected reward

Example: Robot with imperfect motors, game with random events.

3. Dynamic Environments

Problem: The maze changes over time (walls appear/disappear, rewards change).

  • BFS/DFS/Dijkstra: Require recomputation from scratch when the environment changes
  • RL Solution: Can adapt online as the environment changes

Example: Traffic navigation with changing road conditions, dynamic obstacle avoidance.

4. Reward-Based Learning

Problem: You want to optimize for a reward signal, not just reach a goal.

  • BFS/DFS/Dijkstra: Optimize path length or cost
  • RL Solution: Learns to maximize cumulative reward (which may include penalties for long paths, energy consumption, etc.)

Example: Game AI that needs to collect coins while avoiding enemies.

5. Partial Observability

Problem: You can only see part of the maze (e.g., only adjacent cells).

  • BFS/DFS/Dijkstra: Require full observability
  • RL Solution: Can work with partial observations using state estimation

Example: Real-world navigation with limited sensors.

Q-Learning: Model-Free Reinforcement Learning

Q-Learning is a model-free RL algorithm that learns the optimal action-value function Q(s,a)Q(s, a) without needing to know the transition probabilities.

Key Concepts

  • Q-Value: Q(s,a)Q(s, a) = expected cumulative reward from state ss taking action aa, then following the optimal policy
  • Bellman Equation: Q(s,a)=R(s,a)+γmaxaQ(s,a)Q(s, a) = R(s, a) + \gamma \max_{a'} Q(s', a')
  • Update Rule: Q(s,a)Q(s,a)+α[R+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [R + \gamma \max_{a'} Q(s', a') - Q(s, a)]

where:

  • α\alpha = learning rate
  • γ\gamma = discount factor (how much we value future rewards)
  • RR = immediate reward

Exploration vs. Exploitation

  • Exploration: Try random actions to discover the environment (high ϵ\epsilon)
  • Exploitation: Use learned knowledge to take the best action (low ϵ\epsilon)
  • Epsilon-Greedy: With probability ϵ\epsilon, explore; otherwise, exploit

Initially, the agent explores heavily (ϵ1\epsilon \approx 1). As it learns, it exploits more (ϵ0\epsilon \to 0).

Interactive Simulation

Try the different algorithms below. Notice how:

  1. DFS/Dijkstra visualize the search process, assuming full knowledge of the maze.
  2. Q-Learning starts in a "Fog of War". The agent (orange dot) must explore the maze step-by-step to learn the layout and find the goal.
  3. Watch as the Q-Learning agent initially wanders randomly (exploring), then gradually converges to the optimal path as it learns (exploiting).

Comparison Table

AspectBFS/DFS/DijkstraQ-Learning (RL)
Prior KnowledgeRequires full maze mapNone required
StochasticityAssumes deterministicHandles stochastic transitions
AdaptabilityMust recompute on changesAdapts online
OptimizationPath length/costCumulative reward
Partial ObservabilityRequires full observabilityCan work with partial info
LearningNo learning (just search)Learns from experience
Use CaseKnown, static environmentsUnknown, dynamic environments

When to Use Each Approach

Use Classical Algorithms (BFS/DFS/Dijkstra) when:

  • You have complete knowledge of the environment
  • The environment is static
  • Transitions are deterministic
  • You need a solution immediately

Use Reinforcement Learning when:

  • The environment is unknown or partially observable
  • The environment is dynamic or stochastic
  • You want to optimize for complex reward signals
  • You can afford to learn through trial and error

Real-World Applications

  • Classical Algorithms: GPS navigation (known road network), pathfinding in video games (static maps)
  • Reinforcement Learning: Autonomous robots (unknown terrain), game AI (opponent behavior), recommendation systems (changing user preferences), autonomous vehicles (dynamic traffic)

Key Takeaway

Classical algorithms are planning methods: given a model, find the optimal path. Reinforcement learning is a learning method: learn the model and policy simultaneously through interaction. When the environment is unknown, stochastic, or dynamic, RL becomes essential.