Generalization of TD Learning
As of now, we considered two extreme cases of learning from experience:
- TD(0): uses one-step return;
- Monte Carlo: waits until the end of the episode to compute the return.
But what if we want something in between? Something that leverages more future information than TD(0), yet doesn't need to wait for the full episode like Monte Carlo?
This is where n-step TD learning and TD(Ξ») come in β methods that unify and generalize the ideas we've seen so far.
n-Step TD Learning
The idea behind n-step TD learning is simple: instead of using just the next step or the entire episode, we use the next n steps, then bootstrap:
Gt(n)β=Rt+1β+Ξ³Rt+2β+...+Ξ³nβ1Rt+nβ+Ξ³nV(St+1β)This allows for a tradeoff:
- When n=1: it's just TD(0);
- When n=β: it becomes Monte Carlo.
This returns can then be used to replace a target in the TD(0) update rule:
V(Stβ)βV(Stβ)+Ξ±(Gt(n)ββV(Stβ))TD(Ξ»)
TD(Ξ») is a clever idea that builds on top of the n-step TD learning: instead of choosing a fixed n, we combine all n-step returns together:
Ltβ=(1βΞ»)n=0βββΞ»nβ1Gt(n)βwhere Ξ»β[0,1] controls the weighting:
- If Ξ»=0: only one-step return β TD(0);
- If Ξ»=1: full return β Monte Carlo;
- Intermediate values blend multiple step returns.
So Ξ» acts as a bias-variance tradeoff knob:
- Low Ξ»: more bias, less variance;
- High Ξ»: less bias, more variance.
Ltβ can then be used as an update target in the TD(0) update rule:
V(Stβ)βV(Stβ)+Ξ±(LtββV(Stβ))Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.7
Generalization of TD Learning
Swipe to show menu
As of now, we considered two extreme cases of learning from experience:
- TD(0): uses one-step return;
- Monte Carlo: waits until the end of the episode to compute the return.
But what if we want something in between? Something that leverages more future information than TD(0), yet doesn't need to wait for the full episode like Monte Carlo?
This is where n-step TD learning and TD(Ξ») come in β methods that unify and generalize the ideas we've seen so far.
n-Step TD Learning
The idea behind n-step TD learning is simple: instead of using just the next step or the entire episode, we use the next n steps, then bootstrap:
Gt(n)β=Rt+1β+Ξ³Rt+2β+...+Ξ³nβ1Rt+nβ+Ξ³nV(St+1β)This allows for a tradeoff:
- When n=1: it's just TD(0);
- When n=β: it becomes Monte Carlo.
This returns can then be used to replace a target in the TD(0) update rule:
V(Stβ)βV(Stβ)+Ξ±(Gt(n)ββV(Stβ))TD(Ξ»)
TD(Ξ») is a clever idea that builds on top of the n-step TD learning: instead of choosing a fixed n, we combine all n-step returns together:
Ltβ=(1βΞ»)n=0βββΞ»nβ1Gt(n)βwhere Ξ»β[0,1] controls the weighting:
- If Ξ»=0: only one-step return β TD(0);
- If Ξ»=1: full return β Monte Carlo;
- Intermediate values blend multiple step returns.
So Ξ» acts as a bias-variance tradeoff knob:
- Low Ξ»: more bias, less variance;
- High Ξ»: less bias, more variance.
Ltβ can then be used as an update target in the TD(0) update rule:
V(Stβ)βV(Stβ)+Ξ±(LtββV(Stβ))Thanks for your feedback!