Action Values
Action value is a fundamental concept in the MAB problem. It plays a pivotal role in various algorithms, including epsilon-greedy and upper confidence bound. The primary purpose of an action value is to provide an estimate of the expected reward when a specific action is chosen. It is similar to a state-action value, but is independent of a state due to the stateless nature of the MAB problem.
Definition of Action Value
Formally, the action value, denoted as Q(a), represents the expected reward of choosing action a:
Q(a)=E[Rβ£A=a]where:
- R is the reward received;
- A is the action selected.
Since the true reward distribution is typically unknown, we have to estimate Q(a) using observed data.
Estimating Action Values
There are several ways to estimate Q(a) based on observed rewards. The most common method is the sample average estimate, which calculates the mean reward received from selecting action a up to time t:
Qtβ(a)=Ntβ(a)R1β+R2β+...+RNtβ(a)ββ=Ntβ(a)βi=1Ntβ(a)βRiββwhere:
- Qtβ(a) is the estimated value of action a at time step t;
- Ntβ(a) is the number of times action a has been chosen up to time t;
- Riβ is the reward obtained in each instance when action a was taken.
As more samples are collected, this estimate converges to the true expected reward Qββ(a) assuming the reward distribution remains stationary.
A stationary distribution is a distribution that doesn't change over time, no matter what actions are taken or how the environment changes.
Incremental Update Rule
While the formula above can be used to estimate action values, it requires storing all previous rewards, and recomputing their sum on every time step. With incremental updates, this becomes unnecessary. The formula for incremental updates can be derived like this:
Qk+1ββ=k1βi=1βkβRiβ=k1β(Rkβ+i=1βkβ1βRiβ)=k1β(Rkβ+(kβ1)Qkβ)=k1β(Rkβ+kQkββQkβ)=Qkβ+k1β(RkββQkβ)βwhere for some action:
- Qkβ is an estimate of k-th reward, that can be expressed as an average of first kβ1 rewards;
- Rkβ is an actual k-th reward.
Intuition
Knowing the estimate of the k-th reward, Qkβ, and the actual k-th reward, Rkβ, you can measure the error as the difference between these values. Afterward, the next estimate can be calculated by adjusting the previous estimate slightly in the direction of the actual reward, to reduce the error.
This intuition leads to another formula, which looks like this:
Qk+1β=Qkβ+Ξ±(RkββQkβ)where Ξ± is a step-size parameter controlling the rate of learning. Like in the previous formula, alpha can be k1β, and it will result in a sample average estimate. Alternatively, a constant Ξ± is commonly used, as it doesn't require any additional space(to store how many times an action was taken) and allows adaptation to non-stationary environments by placing more weight on recent observations.
Optimistic Initialization
At the beginning of a training process, action value estimates can vary significantly, which may lead to premature exploitation. This means the agent may exploit its initial knowledge too early, favoring suboptimal actions based on limited experience. To mitigate this issue and encourage initial exploration, one simple and effective technique is optimistic initialization.
In optimistic initialization, action values are initialized to relatively high values (e.g., Q0β(a)=1 instead of 0). This approach creates the impression that all actions are initially promising. As a result, the agent is incentivized to explore each action multiple times before settling on the best choice. This technique is most efficient when used in combination with constant step-size.
The optimal action rate in this and future plots refers to the proportion of environments where the optimal action was chosen at a given time step.
For example, if there are 10 test environments, and the optimal action was selected in 6 of them at time step 200, the optimal action rate for that time step would be 0.6. This metric is useful for evaluating performance because it correlates with maximizing the reward, without depending on the exact reward values.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.7
Action Values
Swipe to show menu
Action value is a fundamental concept in the MAB problem. It plays a pivotal role in various algorithms, including epsilon-greedy and upper confidence bound. The primary purpose of an action value is to provide an estimate of the expected reward when a specific action is chosen. It is similar to a state-action value, but is independent of a state due to the stateless nature of the MAB problem.
Definition of Action Value
Formally, the action value, denoted as Q(a), represents the expected reward of choosing action a:
Q(a)=E[Rβ£A=a]where:
- R is the reward received;
- A is the action selected.
Since the true reward distribution is typically unknown, we have to estimate Q(a) using observed data.
Estimating Action Values
There are several ways to estimate Q(a) based on observed rewards. The most common method is the sample average estimate, which calculates the mean reward received from selecting action a up to time t:
Qtβ(a)=Ntβ(a)R1β+R2β+...+RNtβ(a)ββ=Ntβ(a)βi=1Ntβ(a)βRiββwhere:
- Qtβ(a) is the estimated value of action a at time step t;
- Ntβ(a) is the number of times action a has been chosen up to time t;
- Riβ is the reward obtained in each instance when action a was taken.
As more samples are collected, this estimate converges to the true expected reward Qββ(a) assuming the reward distribution remains stationary.
A stationary distribution is a distribution that doesn't change over time, no matter what actions are taken or how the environment changes.
Incremental Update Rule
While the formula above can be used to estimate action values, it requires storing all previous rewards, and recomputing their sum on every time step. With incremental updates, this becomes unnecessary. The formula for incremental updates can be derived like this:
Qk+1ββ=k1βi=1βkβRiβ=k1β(Rkβ+i=1βkβ1βRiβ)=k1β(Rkβ+(kβ1)Qkβ)=k1β(Rkβ+kQkββQkβ)=Qkβ+k1β(RkββQkβ)βwhere for some action:
- Qkβ is an estimate of k-th reward, that can be expressed as an average of first kβ1 rewards;
- Rkβ is an actual k-th reward.
Intuition
Knowing the estimate of the k-th reward, Qkβ, and the actual k-th reward, Rkβ, you can measure the error as the difference between these values. Afterward, the next estimate can be calculated by adjusting the previous estimate slightly in the direction of the actual reward, to reduce the error.
This intuition leads to another formula, which looks like this:
Qk+1β=Qkβ+Ξ±(RkββQkβ)where Ξ± is a step-size parameter controlling the rate of learning. Like in the previous formula, alpha can be k1β, and it will result in a sample average estimate. Alternatively, a constant Ξ± is commonly used, as it doesn't require any additional space(to store how many times an action was taken) and allows adaptation to non-stationary environments by placing more weight on recent observations.
Optimistic Initialization
At the beginning of a training process, action value estimates can vary significantly, which may lead to premature exploitation. This means the agent may exploit its initial knowledge too early, favoring suboptimal actions based on limited experience. To mitigate this issue and encourage initial exploration, one simple and effective technique is optimistic initialization.
In optimistic initialization, action values are initialized to relatively high values (e.g., Q0β(a)=1 instead of 0). This approach creates the impression that all actions are initially promising. As a result, the agent is incentivized to explore each action multiple times before settling on the best choice. This technique is most efficient when used in combination with constant step-size.
The optimal action rate in this and future plots refers to the proportion of environments where the optimal action was chosen at a given time step.
For example, if there are 10 test environments, and the optimal action was selected in 6 of them at time step 200, the optimal action rate for that time step would be 0.6. This metric is useful for evaluating performance because it correlates with maximizing the reward, without depending on the exact reward values.
Thanks for your feedback!