Policy Evaluation
Policy evaluation is a process of determining the value function of a given policy.
Policy evaluation can be used to estimate both state value function and action value function. But for DP methods, state value function will be used.
As you know, a state value function of a given policy can be determined by solving a Bellman equation:
vΟβ(s)=aββΟ(aβ£s)sβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vΟβ(sβ²))If you have a complete model of the environment (i.e., known transition probabilities and expected rewards for all state-action pairs), the only unknown variables remaining in the equation are the state values. Therefore, the equation above can be reformulated as a system of β£Sβ£ linear equations with β£Sβ£ unknowns.
For example, if an MDP has 2 states (s1β, s2β) and 2 actions (move to s1β, move to s2β), the state value function could be defined like this:
{V(s1β)=0.5β (5+0.9β V(s1β))+0.5β (10+0.9β V(s2β))V(s2β)=0.7β (2+0.9β V(s1β))+0.3β (0+0.9β V(s2β))βThis can be solved using standard linear algebra techniques.
A unique solution to such linear system is guaranteed if at least one of the following conditions holds:
- The discount factor satisfies Ξ³<1;
- The policy Ο, when followed from any state s, ensures that the episode eventually terminates.
Iterative Policy Evaluation
The solution can be computed directly, but an iterative approach is more commonly used due to its ease of implementation. This method begins by assigning arbitrary initial values to all states, except for terminal states, which are set to 0. The values are then updated iteratively using the Bellman equation as the update rule:
vk+1β(s)βaββΟ(aβ£s)sβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vkβ(sβ²))The estimated state value function vkβ eventually converges to a true state value function vΟβ as kββ if vΟβ exists.
Value Backup Strategies
When updating value estimates, new estimates are computed based on previous values. The process of preserving previous estimates is known as a backup. There are two common strategies for performing backups:
- Full backup: this method involves storing the new estimates in a separate array, distinct from the one containing the previous (backed-up) values. Consequently, two arrays are required β one for maintaining the previous estimates and another for storing the newly computed values;
- In-place backup: this approach maintains all values within a single array. Each new estimate immediately replaces the previous value. This method reduces memory usage, as only one array is needed.
Typically, the in-place backup method is preferred because it requires less memory and converges more rapidly, due to the immediate use of the latest estimates.
When to stop updating?
In iterative policy evaluation, there is no exact point at which the algorithm should stop. While convergence is guaranteed in the limit, continuing computations beyond a certain point is unnecessary in practice. A simple and effective stopping criterion is to track the absolute difference between consecutive value estimates, β£vk+1β(s)βvkβ(s)β£, and compare it to a small threshold ΞΈ. If, after a full update cycle (where values for all states are updated), no changes exceed ΞΈ, the process can be safely terminated.
Pseudocode
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.7
Policy Evaluation
Swipe to show menu
Policy evaluation is a process of determining the value function of a given policy.
Policy evaluation can be used to estimate both state value function and action value function. But for DP methods, state value function will be used.
As you know, a state value function of a given policy can be determined by solving a Bellman equation:
vΟβ(s)=aββΟ(aβ£s)sβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vΟβ(sβ²))If you have a complete model of the environment (i.e., known transition probabilities and expected rewards for all state-action pairs), the only unknown variables remaining in the equation are the state values. Therefore, the equation above can be reformulated as a system of β£Sβ£ linear equations with β£Sβ£ unknowns.
For example, if an MDP has 2 states (s1β, s2β) and 2 actions (move to s1β, move to s2β), the state value function could be defined like this:
{V(s1β)=0.5β (5+0.9β V(s1β))+0.5β (10+0.9β V(s2β))V(s2β)=0.7β (2+0.9β V(s1β))+0.3β (0+0.9β V(s2β))βThis can be solved using standard linear algebra techniques.
A unique solution to such linear system is guaranteed if at least one of the following conditions holds:
- The discount factor satisfies Ξ³<1;
- The policy Ο, when followed from any state s, ensures that the episode eventually terminates.
Iterative Policy Evaluation
The solution can be computed directly, but an iterative approach is more commonly used due to its ease of implementation. This method begins by assigning arbitrary initial values to all states, except for terminal states, which are set to 0. The values are then updated iteratively using the Bellman equation as the update rule:
vk+1β(s)βaββΟ(aβ£s)sβ²,rββp(sβ²,rβ£s,a)(r+Ξ³vkβ(sβ²))The estimated state value function vkβ eventually converges to a true state value function vΟβ as kββ if vΟβ exists.
Value Backup Strategies
When updating value estimates, new estimates are computed based on previous values. The process of preserving previous estimates is known as a backup. There are two common strategies for performing backups:
- Full backup: this method involves storing the new estimates in a separate array, distinct from the one containing the previous (backed-up) values. Consequently, two arrays are required β one for maintaining the previous estimates and another for storing the newly computed values;
- In-place backup: this approach maintains all values within a single array. Each new estimate immediately replaces the previous value. This method reduces memory usage, as only one array is needed.
Typically, the in-place backup method is preferred because it requires less memory and converges more rapidly, due to the immediate use of the latest estimates.
When to stop updating?
In iterative policy evaluation, there is no exact point at which the algorithm should stop. While convergence is guaranteed in the limit, continuing computations beyond a certain point is unnecessary in practice. A simple and effective stopping criterion is to track the absolute difference between consecutive value estimates, β£vk+1β(s)βvkβ(s)β£, and compare it to a small threshold ΞΈ. If, after a full update cycle (where values for all states are updated), no changes exceed ΞΈ, the process can be safely terminated.
Pseudocode
Thanks for your feedback!