Bellman Equations
Swipe to show menu
A Bellman equation is a functional equation that defines a value function in a recursive form.
To clarify the definition:
- A functional equation is an equation whose solution is a function. For Bellman equation, this solution is the value function for which the equation was formulated;
- A recursive form means that the value at the current state is expressed in terms of values at future states.
In short, solving the Bellman equation gives the desired value function, and deriving this equation requires identifying a recursive relationship between current and future states.
State Value Function
As a reminder, here is a state value function in compact form:
vฯโ(s)=Eฯโ[GtโโฃStโ=s]To obtain the Bellman equation for this value function, let's expand the right side of the equation and establish a recursive relationship:
vฯโ(s)โ=Eฯโ[GtโโฃStโ=s]=Eฯโ[Rt+1โ+ฮณRt+2โ+ฮณ2Rt+3โ+...โฃStโ=s]=Eฯโ[Rt+1โ+ฮณk=0โโโฮณkRt+k+2โโฃStโ=s]=Eฯโ[Rt+1โ+ฮณGt+1โโฃStโ=s]=aโโฯ(aโฃs)sโฒ,rโโp(sโฒ,rโฃs,a)(r+ฮณEฯโ[Gt+1โโฃSt+1โ=sโฒ])=aโโฯ(aโฃs)sโฒ,rโโp(sโฒ,rโฃs,a)(r+ฮณvฯโ(sโฒ))โThe last equation in this chain is a Bellman equation for state value function.
Intuition
To find the value of a state s, you:
- Consider all possible actions a you might take from this state, each weighted by how likely you are to choose that action under your current policy ฯ(aโฃs);
- For each action a, you consider all possible next states sโฒ and rewards r, weighted by their likelihood p(sโฒ,rโฃs,a);
- For each of these outcomes, you take the immediate reward r you get plus the discounted value of the next state ฮณvฯโ(sโฒ).
By summing all these possibilities together, you get the total expected value of the state s under your current policy.
Action Value Function
Here is an action value function in compact form:
qฯโ(s,a)=Eฯโ[GtโโฃStโ=s,Atโ=a]The derivation of Bellman equation for this function is quite similar to the previous one:
qฯโ(s,a)โ=Eฯโ[GtโโฃStโ=s,Atโ=a]=Eฯโ[Rt+1โ+ฮณRt+2โ+ฮณ2Rt+3โ+...โฃStโ=s,Atโ=a]=Eฯโ[Rt+1โ+ฮณk=0โโโฮณkRt+k+2โโฃStโ=s,Atโ=a]=Eฯโ[Rt+1โ+ฮณGt+1โโฃStโ=s,Atโ=a]=sโฒ,rโโp(sโฒ,rโฃs,a)(r+ฮณEฯโ[Gt+1โโฃSt+1โ=sโฒ])=sโฒ,rโโp(sโฒ,rโฃs,a)(r+ฮณaโฒโโฯ(aโฒโฃsโฒ)(Eฯโ[Gt+1โโฃSt+1โ=sโฒ,At+1โ=aโฒ]))=sโฒ,rโโp(sโฒ,rโฃs,a)(r+ฮณaโฒโโฯ(aโฒโฃsโฒ)q(sโฒ,aโฒ))โThe last equation in this chain is a Bellman equation for action value function.
Intuition
To find the value of a state-action pair (s,a), you:
- Consider all possible next states sโฒ and rewards r, weighted by their likelihood p(sโฒ,rโฃs,a);
- For each of these outcomes, you take the immediate reward r you get plus the discounted value of the next state;
- To compute the value of the next state sโฒ, for all actions aโฒ possible from state sโฒ, multiply the action value q(sโฒ,aโฒ) by the probability of choosing aโฒ in state sโฒ under current policy ฯ(aโฒโฃsโฒ. Then, sum up everything to receive the final value.
By summing all these possibilities together, you get the total expected value of the state-action pair (s,a) under your current policy.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat