Incremental Implementations
Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time β especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.
On-Policy Monte Carlo Control
For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:
Q(s,a)βQ(s,a)+Ξ±(GβQ(s,a))where Ξ±=N(s,a)1β for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a) and the amount of times state-action pair (s,a) has been visited N(s,a).
Pseudocode
Off-Policy Monte Carlo Control
For off-policy method with ordinary importance sampling everything is the same as for on-policy method.
A more interesting situation happens with weighted importance sampling. The equation looks the same:
Q(s,a)βQ(s,a)+Ξ±(GβQ(s,a))but Ξ±=N(s,a)1β can't be used because:
- Each return is weighted by Ο;
- Final sum is divided not by N(s,a), but by βΟ(s,a).
the value of Ξ± that can actually be used in this case is equal to C(s,a)Wβ where:
- W is a Ο for current trajectory;
- C(s,a) is equal to βΟ(s,a).
And each time the state-action pair (s,a) occurs, the Ο of current trajectory is added to C(s,a):
C(s,a)βC(s,a)+WPseudocode
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.7
Incremental Implementations
Swipe to show menu
Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time β especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.
On-Policy Monte Carlo Control
For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:
Q(s,a)βQ(s,a)+Ξ±(GβQ(s,a))where Ξ±=N(s,a)1β for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a) and the amount of times state-action pair (s,a) has been visited N(s,a).
Pseudocode
Off-Policy Monte Carlo Control
For off-policy method with ordinary importance sampling everything is the same as for on-policy method.
A more interesting situation happens with weighted importance sampling. The equation looks the same:
Q(s,a)βQ(s,a)+Ξ±(GβQ(s,a))but Ξ±=N(s,a)1β can't be used because:
- Each return is weighted by Ο;
- Final sum is divided not by N(s,a), but by βΟ(s,a).
the value of Ξ± that can actually be used in this case is equal to C(s,a)Wβ where:
- W is a Ο for current trajectory;
- C(s,a) is equal to βΟ(s,a).
And each time the state-action pair (s,a) occurs, the Ο of current trajectory is added to C(s,a):
C(s,a)βC(s,a)+WPseudocode
Thanks for your feedback!