Learn Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where $\displaystyle \alpha = \frac{1}{N(s, a)}$ for mean estimate. The only values that have to be stored are the current estimates of action values $Q(s, a)$ and the amount of times state-action pair $(s, a)$ has been visited $N(s, a)$ .

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but $\displaystyle \alpha = \frac{1}{N(s, a)}$ can't be used because:

Each return is weighted by $\rho$ ;
Final sum is divided not by $N(s, a)$ , but by $\sum \rho(s, a)$ .

the value of $\alpha$ that can actually be used in this case is equal to $\displaystyle \frac{W}{C(s,a)}$ where:

$W$ is a $\rho$ for current trajectory;
$C(s, a)$ is equal to $\sum \rho(s, a)$ .

And each time the state-action pair $(s, a)$ occurs, the $\rho$ of current trajectory is added to $C(s, a)$ :

C(s, a) \gets C(s, a) + W

Pseudocode

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 7

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the difference between on-policy and off-policy Monte Carlo control?

How does incremental computation improve efficiency in Monte Carlo methods?

Can you clarify how the weighted importance sampling update works?

Swipe to show menu

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but $\displaystyle \alpha = \frac{1}{N(s, a)}$ can't be used because:

Each return is weighted by $\rho$ ;
Final sum is divided not by $N(s, a)$ , but by $\sum \rho(s, a)$ .

the value of $\alpha$ that can actually be used in this case is equal to $\displaystyle \frac{W}{C(s,a)}$ where:

$W$ is a $\rho$ for current trajectory;
$C(s, a)$ is equal to $\sum \rho(s, a)$ .

And each time the state-action pair $(s, a)$ occurs, the $\rho$ of current trajectory is added to $C(s, a)$ :

C(s, a) \gets C(s, a) + W

Pseudocode

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 7