Learn TD(0): Value Function Estimation | Temporal Difference Learning

The simplest version of TD learning is called TD(0). It updates the value of a state based on the immediate reward and the estimated value of the next state. It is a one-step TD method.

Update Rule

Given a state $S_t$ , reward $R_{t+1}$ and next state $S_{t+1}$ , the update rule looks like this:

V(S_t) \gets V(S_t) + \alpha\Bigl(R_{t+1}+\gamma V(S_{t+1}) - V(S_t)\Bigr)

where

$\alpha$ is a learning rate, or step size;
$\delta_t = R_{t+1}+\gamma V(S_{t+1}) - V(S_t)$ is a TD error.

Intuition

The state value function $v_\pi$ can be defined and expanded as follows:

\def\E{\operatorname{\mathbb{E}}} \begin{aligned} v_\pi(s) &= \E_\pi[G_t | S_t = s] \\ &= \E_\pi[R_t + \gamma G_{t+1} | S_t = s] \\ &= \E_\pi[R_t + \gamma v_\pi(S_{t+1}) | S_t = s] \end{aligned}

This gives the first part of $\delta_t$ — the experienced return $R_{t+1} + \gamma V(S_{t+1})$ . And the second part of $\delta_t$ is the expected return $V(S_t)$ . The TD error $\delta_t$ is therefore the observable discrepancy between what actually happened and what we previously believed to happen. So the update rule adjusts the previous belief a little on each step, making it closer to the truth.

TD(0) vs Monte Carlo Estimation

Both of TD(0) and Monte Carlo estimation use sampled experience to estimate the state value function $v_\pi(s)$ for a policy $\pi$ . Under standard convergence conditions, they both converge to the true $v_\pi(s)$ as the number of visits to each state goes to infinity. In practice, however, we only ever have a finite amount of data, and the two methods differ significantly in how they use that data and how quickly they learn.

Bias-Variance Tradeoff

From a bias–variance tradeoff perspective:

Monte Carlo estimation waits until an episode ends and then uses the full return to update values. This yields unbiased estimates — the returns truly reflect the underlying distribution — but they can swing dramatically, especially in long or highly stochastic tasks. High variance means many episodes are required to average out the noise and obtain stable value estimates.

TD(0) bootstraps by combining each one‑step reward with the current estimate of the next state's value. This introduces bias — early updates rely on imperfect estimates — but keeps variance low, since each update is based on a small, incremental error. Lower variance lets TD(0) propagate reward information through the state space more quickly, even though initial bias can slow down the convergence.

Learning Data vs Learning Model

Another way to look at these two methods is to analyze what each of them really learns:

Monte Carlo estimation learns directly from the observed returns, effectively fitting its value estimates to the specific episodes it has seen. This means it minimizes error on those training trajectories, but because it never builds an explicit view of how states lead to one another, it can struggle to generalize to new or slightly different situations.

TD(0), by contrast, bootstraps on each one-step transition, combining the immediate reward with its estimate of the next state's value. In doing so, it effectively captures the relationships between states — an implicit model of the environment's dynamics. This model‑like understanding lets TD(0) generalize better to unseen transitions, often yielding more accurate value estimates on new data.

Pseudocode

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu