Learn Generalization of TD Learning | Temporal Difference Learning

As of now, we considered two extreme cases of learning from experience:

TD(0): uses one-step return;
Monte Carlo: waits until the end of the episode to compute the return.

But what if we want something in between? Something that leverages more future information than TD(0), yet doesn't need to wait for the full episode like Monte Carlo?

This is where $n$ -step TD learning and TD( $\lambda$ ) come in — methods that unify and generalize the ideas we've seen so far.

$\Large n$ -Step TD Learning

The idea behind $n$ -step TD learning is simple: instead of using just the next step or the entire episode, we use the next $n$ steps, then bootstrap:

G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+1})

This allows for a tradeoff:

When $n = 1$ : it's just TD(0);
When $n = \infty$ : it becomes Monte Carlo.

This returns can then be used to replace a target in the TD(0) update rule:

V(S_t) \gets V(S_t) + \alpha\Bigl(G_t^{(n)} - V(S_t)\Bigr)

TD( $\Large\lambda$ )

TD( $\lambda$ ) is a clever idea that builds on top of the $n$ -step TD learning: instead of choosing a fixed $n$ , we combine all $n$ -step returns together:

L_t = (1 - \lambda) \sum_{n=0}^{\infty} \lambda^{n-1}G_t^{(n)}

where $\lambda \in [0, 1]$ controls the weighting:

If $\lambda = 0$ : only one-step return $\to$ TD(0);
If $\lambda = 1$ : full return $\to$ Monte Carlo;
Intermediate values blend multiple step returns.

So $\lambda$ acts as a bias-variance tradeoff knob:

Low $\lambda$ : more bias, less variance;
High $\lambda$ : less bias, more variance.

$L_t$ can then be used as an update target in the TD(0) update rule:

V(S_t) \gets V(S_t) + \alpha\Bigl(L_t - V(S_t)\Bigr)

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how to choose the best value for $$n$$ or $$\lambda$$ in practice?

What are the main advantages and disadvantages of using TD($$\lambda$$) compared to TD(0) and Monte Carlo?

Can you provide an example to illustrate how n-step TD learning works?

Swipe to show menu

As of now, we considered two extreme cases of learning from experience:

TD(0): uses one-step return;
Monte Carlo: waits until the end of the episode to compute the return.

But what if we want something in between? Something that leverages more future information than TD(0), yet doesn't need to wait for the full episode like Monte Carlo?

This is where $n$ -step TD learning and TD( $\lambda$ ) come in — methods that unify and generalize the ideas we've seen so far.

$\Large n$ -Step TD Learning

The idea behind $n$ -step TD learning is simple: instead of using just the next step or the entire episode, we use the next $n$ steps, then bootstrap:

G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+1})

This allows for a tradeoff:

When $n = 1$ : it's just TD(0);
When $n = \infty$ : it becomes Monte Carlo.

This returns can then be used to replace a target in the TD(0) update rule:

V(S_t) \gets V(S_t) + \alpha\Bigl(G_t^{(n)} - V(S_t)\Bigr)

TD( $\Large\lambda$ )

TD( $\lambda$ ) is a clever idea that builds on top of the $n$ -step TD learning: instead of choosing a fixed $n$ , we combine all $n$ -step returns together:

L_t = (1 - \lambda) \sum_{n=0}^{\infty} \lambda^{n-1}G_t^{(n)}

where $\lambda \in [0, 1]$ controls the weighting:

If $\lambda = 0$ : only one-step return $\to$ TD(0);
If $\lambda = 1$ : full return $\to$ Monte Carlo;
Intermediate values blend multiple step returns.

So $\lambda$ acts as a bias-variance tradeoff knob:

Low $\lambda$ : more bias, less variance;
High $\lambda$ : less bias, more variance.

$L_t$ can then be used as an update target in the TD(0) update rule:

V(S_t) \gets V(S_t) + \alpha\Bigl(L_t - V(S_t)\Bigr)

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 5

Generalization of TD Learning

n\Large nn-Step TD Learning

TD(λ\Large\lambdaλ)

Generalization of TD Learning

n\Large nn-Step TD Learning

TD(λ\Large\lambdaλ)

$\Large n$ -Step TD Learning

TD( $\Large\lambda$ )

$\Large n$ -Step TD Learning

TD( $\Large\lambda$ )