TD(0): Value Function Estimation
The simplest version of TD learning is called TD(0). It updates the value of a state based on the immediate reward and the estimated value of the next state. It is a one-step TD method.
Update Rule
Given a state Stβ, reward Rt+1β and next state St+1β, the update rule looks like this:
V(Stβ)βV(Stβ)+Ξ±(Rt+1β+Ξ³V(St+1β)βV(Stβ))where
- Ξ± is a learning rate, or step size;
- Ξ΄tβ=Rt+1β+Ξ³V(St+1β)βV(Stβ) is a TD error.
Intuition
The state value function vΟβ can be defined and expanded as follows:
vΟβ(s)β=EΟβ[Gtββ£Stβ=s]=EΟβ[Rtβ+Ξ³Gt+1ββ£Stβ=s]=EΟβ[Rtβ+Ξ³vΟβ(St+1β)β£Stβ=s]βThis gives the first part of Ξ΄tβ β the experienced return Rt+1β+Ξ³V(St+1β). And the second part of Ξ΄tβ is the expected return V(Stβ). The TD error Ξ΄tββ is therefore the observable discrepancy between what actually happened and what we previously believed to happen. So the update rule adjusts the previous belief a little on each step, making it closer to the truth.
TD(0) vs Monte Carlo Estimation
Both of TD(0) and Monte Carlo estimation use sampled experience to estimate the state value function vΟβ(s) for a policy Ο. Under standard convergence conditions, they both converge to the true vΟβ(s) as the number of visits to each state goes to infinity. In practice, however, we only ever have a finite amount of data, and the two methods differ significantly in how they use that data and how quickly they learn.
Bias-Variance Tradeoff
From a biasβvariance tradeoff perspective:
Monte Carlo estimation waits until an episode ends and then uses the full return to update values. This yields unbiased estimates β the returns truly reflect the underlying distribution β but they can swing dramatically, especially in long or highly stochastic tasks. High variance means many episodes are required to average out the noise and obtain stable value estimates.
TD(0) bootstraps by combining each oneβstep reward with the current estimate of the next state's value. This introduces bias β early updates rely on imperfect estimates β but keeps variance low, since each update is based on a small, incremental error. Lower variance lets TD(0) propagate reward information through the state space more quickly, even though initial bias can slow down the convergence.
Learning Data vs Learning Model
Another way to look at these two methods is to analyze what each of them really learns:
Monte Carlo estimation learns directly from the observed returns, effectively fitting its value estimates to the specific episodes it has seen. This means it minimizes error on those training trajectories, but because it never builds an explicit view of how states lead to one another, it can struggle to generalize to new or slightly different situations.
TD(0), by contrast, bootstraps on each one-step transition, combining the immediate reward with its estimate of the next state's value. In doing so, it effectively captures the relationships between states β an implicit model of the environment's dynamics. This modelβlike understanding lets TD(0) generalize better to unseen transitions, often yielding more accurate value estimates on new data.
Pseudocode
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.7
TD(0): Value Function Estimation
Swipe to show menu
The simplest version of TD learning is called TD(0). It updates the value of a state based on the immediate reward and the estimated value of the next state. It is a one-step TD method.
Update Rule
Given a state Stβ, reward Rt+1β and next state St+1β, the update rule looks like this:
V(Stβ)βV(Stβ)+Ξ±(Rt+1β+Ξ³V(St+1β)βV(Stβ))where
- Ξ± is a learning rate, or step size;
- Ξ΄tβ=Rt+1β+Ξ³V(St+1β)βV(Stβ) is a TD error.
Intuition
The state value function vΟβ can be defined and expanded as follows:
vΟβ(s)β=EΟβ[Gtββ£Stβ=s]=EΟβ[Rtβ+Ξ³Gt+1ββ£Stβ=s]=EΟβ[Rtβ+Ξ³vΟβ(St+1β)β£Stβ=s]βThis gives the first part of Ξ΄tβ β the experienced return Rt+1β+Ξ³V(St+1β). And the second part of Ξ΄tβ is the expected return V(Stβ). The TD error Ξ΄tββ is therefore the observable discrepancy between what actually happened and what we previously believed to happen. So the update rule adjusts the previous belief a little on each step, making it closer to the truth.
TD(0) vs Monte Carlo Estimation
Both of TD(0) and Monte Carlo estimation use sampled experience to estimate the state value function vΟβ(s) for a policy Ο. Under standard convergence conditions, they both converge to the true vΟβ(s) as the number of visits to each state goes to infinity. In practice, however, we only ever have a finite amount of data, and the two methods differ significantly in how they use that data and how quickly they learn.
Bias-Variance Tradeoff
From a biasβvariance tradeoff perspective:
Monte Carlo estimation waits until an episode ends and then uses the full return to update values. This yields unbiased estimates β the returns truly reflect the underlying distribution β but they can swing dramatically, especially in long or highly stochastic tasks. High variance means many episodes are required to average out the noise and obtain stable value estimates.
TD(0) bootstraps by combining each oneβstep reward with the current estimate of the next state's value. This introduces bias β early updates rely on imperfect estimates β but keeps variance low, since each update is based on a small, incremental error. Lower variance lets TD(0) propagate reward information through the state space more quickly, even though initial bias can slow down the convergence.
Learning Data vs Learning Model
Another way to look at these two methods is to analyze what each of them really learns:
Monte Carlo estimation learns directly from the observed returns, effectively fitting its value estimates to the specific episodes it has seen. This means it minimizes error on those training trajectories, but because it never builds an explicit view of how states lead to one another, it can struggle to generalize to new or slightly different situations.
TD(0), by contrast, bootstraps on each one-step transition, combining the immediate reward with its estimate of the next state's value. In doing so, it effectively captures the relationships between states β an implicit model of the environment's dynamics. This modelβlike understanding lets TD(0) generalize better to unseen transitions, often yielding more accurate value estimates on new data.
Pseudocode
Thanks for your feedback!