Learn Q-Learning: Off-Policy TD Learning | Temporal Difference Learning

Learning an optimal policy with SARSA can be challenging. Similar to on-policy Monte Carlo control, it typically requires a gradual decay of $\varepsilon$ over time, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time. An alternative is to use an off-policy method like Q-learning.

Definition

Q-learning is an off-policy TD control algorithm used to estimate the optimal action value function $q_*(s, a)$ . It updates its estimates based on the current best action, making it an off-policy algorithm.

Update Rule

Unlike in off-policy Monte Carlo control, Q-learning does not require importance sampling to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles SARSA, but with a key difference.

The Q-learning update rule is:

Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \Bigl(R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\Bigr)

The only difference from SARSA is in the target value. Instead of using the value of the next action actually taken, as SARSA does:

\gamma Q(S_{t+1}, A_{t+1})

Q-learning uses the value of the best possible next action:

\gamma \max_a Q(S_{t+1}, a)

This subtle change has a big impact: it allows Q-learning to evaluate actions using an estimate of the optimal policy, even while the agent is still exploring. That's what makes it an off-policy method — it learns about the greedy policy, regardless of the actions chosen during training.