Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Bellman Equations | Dynamic Programming
Introduction to Reinforcement Learning with Python

bookBellman Equations

Swipe to show menu

Note
Definition

A Bellman equation is a functional equation that defines a value function in a recursive form.

To clarify the definition:

  • A functional equation is an equation whose solution is a function. For Bellman equation, this solution is the value function for which the equation was formulated;
  • A recursive form means that the value at the current state is expressed in terms of values at future states.

In short, solving the Bellman equation gives the desired value function, and deriving this equation requires identifying a recursive relationship between current and future states.

State Value Function

As a reminder, here is a state value function in compact form:

vฯ€(s)=Eโกฯ€[GtโˆฃSt=s]\def\E{\operatorname{\mathbb{E}}} v_\pi(s) = \E_\pi[G_t | S_t = s]

To obtain the Bellman equation for this value function, let's expand the right side of the equation and establish a recursive relationship:

vฯ€(s)=Eโกฯ€[GtโˆฃSt=s]=Eโกฯ€[Rt+1+ฮณRt+2+ฮณ2Rt+3+...โˆฃSt=s]=Eโกฯ€[Rt+1+ฮณโˆ‘k=0โˆžฮณkRt+k+2โˆฃSt=s]=Eโกฯ€[Rt+1+ฮณGt+1โˆฃSt=s]=โˆ‘aฯ€(aโˆฃs)โˆ‘sโ€ฒ,rp(sโ€ฒ,rโˆฃs,a)(r+ฮณEโกฯ€[Gt+1โˆฃSt+1=sโ€ฒ])=โˆ‘aฯ€(aโˆฃs)โˆ‘sโ€ฒ,rp(sโ€ฒ,rโˆฃs,a)(r+ฮณvฯ€(sโ€ฒ))\def\E{\operatorname{\mathbb{E}}} \begin{aligned} v_\pi(s) &= \E_\pi[G_t | S_t = s]\\ &= \E_\pi[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]\\ &= \E_\pi[R_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s]\\ &= \E_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s]\\ &= \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \E_\pi\Bigl[G_{t+1} | S_{t+1} = s'\Bigr]\Bigr)\\ &= \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr) \end{aligned}

The last equation in this chain is a Bellman equation for state value function.

Intuition

To find the value of a state ss, you:

  1. Consider all possible actions aa you might take from this state, each weighted by how likely you are to choose that action under your current policy ฯ€(aโˆฃs)\pi(a | s);
  2. For each action aa, you consider all possible next states sโ€ฒs' and rewards rr, weighted by their likelihood p(sโ€ฒ,rโˆฃs,a)p(s', r | s, a);
  3. For each of these outcomes, you take the immediate reward rr you get plus the discounted value of the next state ฮณvฯ€(sโ€ฒ)\gamma v_\pi(s').

By summing all these possibilities together, you get the total expected value of the state ss under your current policy.

Action Value Function

Here is an action value function in compact form:

qฯ€(s,a)=Eโกฯ€[GtโˆฃSt=s,At=a]\def\E{\operatorname{\mathbb{E}}} q_\pi(s, a) = \E_\pi[G_t | S_t = s, A_t = a]

The derivation of Bellman equation for this function is quite similar to the previous one:

qฯ€(s,a)=Eโกฯ€[GtโˆฃSt=s,At=a]=Eโกฯ€[Rt+1+ฮณRt+2+ฮณ2Rt+3+...โˆฃSt=s,At=a]=Eโกฯ€[Rt+1+ฮณโˆ‘k=0โˆžฮณkRt+k+2โˆฃSt=s,At=a]=Eโกฯ€[Rt+1+ฮณGt+1โˆฃSt=s,At=a]=โˆ‘sโ€ฒ,rp(sโ€ฒ,rโˆฃs,a)(r+ฮณEโกฯ€[Gt+1โˆฃSt+1=sโ€ฒ])=โˆ‘sโ€ฒ,rp(sโ€ฒ,rโˆฃs,a)(r+ฮณโˆ‘aโ€ฒฯ€(aโ€ฒโˆฃsโ€ฒ)(Eโกฯ€[Gt+1โˆฃSt+1=sโ€ฒ,At+1=aโ€ฒ]))=โˆ‘sโ€ฒ,rp(sโ€ฒ,rโˆฃs,a)(r+ฮณโˆ‘aโ€ฒฯ€(aโ€ฒโˆฃsโ€ฒ)q(sโ€ฒ,aโ€ฒ))\def\E{\operatorname{\mathbb{E}}} \begin{aligned} q_\pi(s, a) &= \E_\pi[G_t | S_t = s, A_t = a]\\ &= \E_\pi[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]\\ &= \E_\pi[R_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a]\\ &= \E_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a]\\ &= \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \E_\pi\Bigl[G_{t+1} | S_{t+1} = s'\Bigr]\Bigr)\\ &= \sum_{s', r} p(s', r | s, a)\Biggl(r + \gamma \sum_{a'} \pi(a' | s') \Bigl(\E_\pi\Bigl[G_{t+1} | S_{t+1} = s', A_{t+1} = a'\Bigr]\Bigr)\Biggr)\\ &= \sum_{s', r} p(s', r | s, a)\Biggl(r + \gamma \sum_{a'} \pi(a' | s') q(s', a')\Biggr) \end{aligned}

The last equation in this chain is a Bellman equation for action value function.

Intuition

To find the value of a state-action pair (s,a)(s, a), you:

  1. Consider all possible next states sโ€ฒs' and rewards rr, weighted by their likelihood p(sโ€ฒ,rโˆฃs,a)p(s', r | s, a);
  2. For each of these outcomes, you take the immediate reward rr you get plus the discounted value of the next state;
  3. To compute the value of the next state sโ€ฒs', for all actions aโ€ฒa' possible from state sโ€ฒs', multiply the action value q(sโ€ฒ,aโ€ฒ)q(s', a') by the probability of choosing aโ€ฒa' in state sโ€ฒs' under current policy ฯ€(aโ€ฒโˆฃsโ€ฒ\pi(a' | s'. Then, sum up everything to receive the final value.

By summing all these possibilities together, you get the total expected value of the state-action pair (s,a)(s, a) under your current policy.

question mark

Which of the following best describes the function of the Bellman equation?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Sectionย 3. Chapterย 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Sectionย 3. Chapterย 2
some-alt