Reinforcement Learning: The Inverted Pendulum

The Inverted Pendulum (or CartPole) is a classic problem in control theory and reinforcement learning. The goal is to balance a pole on a moving cart by applying forces to the cart (left or right).

Fast Forward

Controls: Toggle "Fast Forward" to speed up the learning process. The simulation stops automatically after 500 failures or when convergence is reached.

The Problem

State: The system is defined by 4 continuous variables:
- $x$ : Position of the cart
- $\dot{x}$ : Velocity of the cart
- $\theta$ : Angle of the pole (0 is upright)
- $\dot{\theta}$ : Angular velocity of the pole
Actions: The agent can push the cart Left (0) or Right (1).
Reward: The agent receives a reward of 0 for every step the pole remains upright, and -1 when the pole falls (or cart moves out of bounds).
Goal: Maximize the time the pole stays upright.

The Solution: Model-Based Reinforcement Learning

This simulation uses a Model-Based RL approach with Value Iteration.

Discretization: The continuous state space $(x, \dot{x}, \theta, \dot{\theta})$ is discretized into 163 distinct states to make the problem tractable for tabular RL.
Model Learning: The agent learns the transition probabilities $P(s'|s, a)$ and reward function $R(s)$ by observing the results of its actions.
Planning (Value Iteration): After each failure (episode), the agent solves the Bellman equation to find the optimal Value Function $V^*(s)$ : $V(s) = \max_a \sum_{s'} P(s'|s,a) [R(s) + \gamma V(s')]$
Policy: The agent chooses the action that maximizes the expected future value (greedy policy with respect to $V$ ).

Simulation Explanation

The real-time simulation above demonstrates the learning process. The agent starts with no knowledge (random actions). As it fails, it updates its model and improves its policy.

Black Line: Log of steps the pole was balanced for in each trial.
Red Dashed Line: Smoothed moving average.
Observation: Notice how the "steps to failure" increases as the agent learns (the curve goes up).

Key Concepts Demonstrated

Exploration vs. Exploitation: Initially, the agent explores by falling often. As it builds a better model, it exploits that knowledge to balance the pole longer.
Delayed Reward: The agent learns that certain actions (like moving too fast towards the edge) eventually lead to a fall (negative reward), even if the immediate state seems fine.
Convergence: After sufficient trials, the value function converges, and the agent can balance the pole indefinitely (or until the simulation limit).

Implementation Details

The simulation runs entirely in your browser using JavaScript. The physics engine uses Euler's method for integration, and the RL agent performs Value Iteration steps asynchronously between trials.

State Discretization Logic

The continuous state is mapped to buckets. A finer discretization allows for better control but requires more data to learn.

// Example of discretization logic used in the simulation
if (x < -2.4 || x > 2.4 || theta < -twelve_deg || theta > twelve_deg) {
  return total_states - 1; // Failure state
}

// Map x, x_dot, theta, theta_dot to integer buckets...
// Total 163 states.

Value Iteration

function update_mdp_value(mdp_data, tolerance, gamma) {
  // ... iterate until max_change < tolerance
  new_value[s] = Math.max(q_values_0, q_values_1);
}

Reinforcement Learning: The Inverted Pendulum

The Problem​

The Solution: Model-Based Reinforcement Learning​

Simulation Explanation​

Key Concepts Demonstrated​

Implementation Details​

State Discretization Logic​

Value Iteration​