Q-Learning: Off-Policy TD Learning
Learning an optimal policy with SARSA can be challenging. Similar to on-policy Monte Carlo control, it typically requires a gradual decay of Ξ΅ over time, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time. An alternative is to use an off-policy method like Q-learning.
Q-learning is an off-policy TD control algorithm used to estimate the optimal action value function qββ(s,a). It updates its estimates based on the current best action, making it an off-policy algorithm.
Update Rule
Unlike in off-policy Monte Carlo control, Q-learning does not require importance sampling to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles SARSA, but with a key difference.
The Q-learning update rule is:
Q(Stβ,Atβ)βQ(Stβ,Atβ)+Ξ±(Rt+1β+Ξ³amaxβQ(St+1β,a)βQ(Stβ,Atβ))The only difference from SARSA is in the target value. Instead of using the value of the next action actually taken, as SARSA does:
Ξ³Q(St+1β,At+1β)Q-learning uses the value of the best possible next action:
Ξ³amaxβQ(St+1β,a)This subtle change has a big impact: it allows Q-learning to evaluate actions using an estimate of the optimal policy, even while the agent is still exploring. That's what makes it an off-policy method β it learns about the greedy policy, regardless of the actions chosen during training.
When to Use Q-Learning?
Q-learning is preferable when:
- You are dealing with deterministic environments, or environments;
- You need a faster convergence speed.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.7
Q-Learning: Off-Policy TD Learning
Swipe to show menu
Learning an optimal policy with SARSA can be challenging. Similar to on-policy Monte Carlo control, it typically requires a gradual decay of Ξ΅ over time, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time. An alternative is to use an off-policy method like Q-learning.
Q-learning is an off-policy TD control algorithm used to estimate the optimal action value function qββ(s,a). It updates its estimates based on the current best action, making it an off-policy algorithm.
Update Rule
Unlike in off-policy Monte Carlo control, Q-learning does not require importance sampling to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles SARSA, but with a key difference.
The Q-learning update rule is:
Q(Stβ,Atβ)βQ(Stβ,Atβ)+Ξ±(Rt+1β+Ξ³amaxβQ(St+1β,a)βQ(Stβ,Atβ))The only difference from SARSA is in the target value. Instead of using the value of the next action actually taken, as SARSA does:
Ξ³Q(St+1β,At+1β)Q-learning uses the value of the best possible next action:
Ξ³amaxβQ(St+1β,a)This subtle change has a big impact: it allows Q-learning to evaluate actions using an estimate of the optimal policy, even while the agent is still exploring. That's what makes it an off-policy method β it learns about the greedy policy, regardless of the actions chosen during training.
When to Use Q-Learning?
Q-learning is preferable when:
- You are dealing with deterministic environments, or environments;
- You need a faster convergence speed.
Thanks for your feedback!