Policy Improvement
Policy improvement is a process of improving the policy based on current value function estimates.
Like with policy evaluation, policy improvement can work with both state value function and action value function. But for DP methods, state value function will be used.
Now that you can estimate state value function for any policy, a natural next step is to explore whether there are any policies better than the current one. One way of doing this, is to consider taking a different action a in a state s, and to follow the current policy afterwards. If this sounds familiar, it's because this is similar to how we define the action value function:
qΟβ(s,a)=sβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vΟβ(sβ²))If this new value is greater than the original state value vΟβ(s), it indicates that taking action a in state s and then continuing with policy Ο leads to better outcomes than strictly following policy Ο. Since states are independent, it's optimal to always select action a whenever state s is encountered. Therefore, we can construct an improved policy Οβ², identical to Ο except that it selects action a in state s, which would be superior to the original policy Ο.
Policy Improvement Theorem
The reasoning described above can be generalized as the policy improvement theorem:
βΉβqΟβ(s,Οβ²(s))β₯vΟβ(s)vΟβ²β(s)β₯vΟβ(s)ββsβSβsβSβThe proof of this theorem is relatively simple, and can be achieved by a repeated substitution:
vΟβ(s)ββ€qΟβ(s,Οβ²(s))=EΟβ²β[Rt+1β+Ξ³vΟβ(St+1β)β£Stβ=s]β€EΟβ²β[Rt+1β+Ξ³qΟβ(St+1β,Οβ²(St+1β))β£Stβ=s]=EΟβ²β[Rt+1β+Ξ³EΟβ²β[Rt+2β+Ξ³vΟβ(St+2β)]β£Stβ=s]=EΟβ²β[Rt+1β+Ξ³Rt+2β+Ξ³2vΟβ(St+2β)β£Stβ=s]...β€EΟβ²β[Rt+1β+Ξ³Rt+2β+Ξ³2Rt+3β+...β£Stβ=s]=vΟβ²β(s)βImprovement Strategy
While updating actions for certain states can lead to improvements, it's more effective to update actions for all states simultaneously. Specifically, for each state s, select the action a that maximizes the action value qΟβ(s,a):
Οβ²(s)ββaargmaxβqΟβ(s,a)βaargmaxβsβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vΟβ(sβ²))βwhere argmax (short for argument of the maximum) is an operator that returns the value of the variable that maximizes a given function.
The resulting greedy policy, denoted by Οβ², satisfies the conditions of the policy improvement theorem by construction, guaranteeing that Οβ² is at least as good as the original policy Ο, and typically better.
If Οβ² is as good as, but not better than Ο, then both Οβ² and Ο are optimal policies, as their value functions are equal, and satisfy Bellman optimality equation:
vΟβ(s)=amaxβsβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vΟβ(sβ²))Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.7
Policy Improvement
Swipe to show menu
Policy improvement is a process of improving the policy based on current value function estimates.
Like with policy evaluation, policy improvement can work with both state value function and action value function. But for DP methods, state value function will be used.
Now that you can estimate state value function for any policy, a natural next step is to explore whether there are any policies better than the current one. One way of doing this, is to consider taking a different action a in a state s, and to follow the current policy afterwards. If this sounds familiar, it's because this is similar to how we define the action value function:
qΟβ(s,a)=sβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vΟβ(sβ²))If this new value is greater than the original state value vΟβ(s), it indicates that taking action a in state s and then continuing with policy Ο leads to better outcomes than strictly following policy Ο. Since states are independent, it's optimal to always select action a whenever state s is encountered. Therefore, we can construct an improved policy Οβ², identical to Ο except that it selects action a in state s, which would be superior to the original policy Ο.
Policy Improvement Theorem
The reasoning described above can be generalized as the policy improvement theorem:
βΉβqΟβ(s,Οβ²(s))β₯vΟβ(s)vΟβ²β(s)β₯vΟβ(s)ββsβSβsβSβThe proof of this theorem is relatively simple, and can be achieved by a repeated substitution:
vΟβ(s)ββ€qΟβ(s,Οβ²(s))=EΟβ²β[Rt+1β+Ξ³vΟβ(St+1β)β£Stβ=s]β€EΟβ²β[Rt+1β+Ξ³qΟβ(St+1β,Οβ²(St+1β))β£Stβ=s]=EΟβ²β[Rt+1β+Ξ³EΟβ²β[Rt+2β+Ξ³vΟβ(St+2β)]β£Stβ=s]=EΟβ²β[Rt+1β+Ξ³Rt+2β+Ξ³2vΟβ(St+2β)β£Stβ=s]...β€EΟβ²β[Rt+1β+Ξ³Rt+2β+Ξ³2Rt+3β+...β£Stβ=s]=vΟβ²β(s)βImprovement Strategy
While updating actions for certain states can lead to improvements, it's more effective to update actions for all states simultaneously. Specifically, for each state s, select the action a that maximizes the action value qΟβ(s,a):
Οβ²(s)ββaargmaxβqΟβ(s,a)βaargmaxβsβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vΟβ(sβ²))βwhere argmax (short for argument of the maximum) is an operator that returns the value of the variable that maximizes a given function.
The resulting greedy policy, denoted by Οβ², satisfies the conditions of the policy improvement theorem by construction, guaranteeing that Οβ² is at least as good as the original policy Ο, and typically better.
If Οβ² is as good as, but not better than Ο, then both Οβ² and Ο are optimal policies, as their value functions are equal, and satisfy Bellman optimality equation:
vΟβ(s)=amaxβsβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vΟβ(sβ²))Thanks for your feedback!