Incremental Implementations
Swipe to show menu
Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time β especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.
On-Policy Monte Carlo Control
For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:
Q(s,a)βQ(s,a)+Ξ±(GβQ(s,a))where Ξ±=N(s,a)1β for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a) and the amount of times state-action pair (s,a) has been visited N(s,a).
Pseudocode
Off-Policy Monte Carlo Control
For off-policy method with ordinary importance sampling everything is the same as for on-policy method.
A more interesting situation happens with weighted importance sampling. The equation looks the same:
Q(s,a)βQ(s,a)+Ξ±(GβQ(s,a))but Ξ±=N(s,a)1β can't be used because:
- Each return is weighted by Ο;
- Final sum is divided not by N(s,a), but by βΟ(s,a).
the value of Ξ± that can actually be used in this case is equal to C(s,a)Wβ where:
- W is a Ο for current trajectory;
- C(s,a) is equal to βΟ(s,a).
And each time the state-action pair (s,a) occurs, the Ο of current trajectory is added to C(s,a):
C(s,a)βC(s,a)+WPseudocode
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat