Gradient Bandits Algorithm
When dealing with multi-armed bandits, traditional methods like epsilon-greedy and UCB estimate action values to decide which action to take. However, gradient bandits take a different approach β they learn preferences for actions instead of estimating their values. These preferences are adjusted over time using stochastic gradient ascent.
Preferences
Instead of maintaining action value estimates Q(a), gradient bandits maintain preference values H(a) for each action a. These preferences are updated using a stochastic gradient ascent approach to maximize expected rewards. Each action's probability is computed using a softmax function:
P(Atβ=a)=βb=1nβeHtβ(b)eHtβ(a)β=Οtβ(a)where:
- Htβ(a) is the preference for action a on a time step t;
- P(Atβ=a) is the probability of selecting action a on a time step t;
- The denominator ensures that probabilities sum to 1.
Softmax is a crucial function in ML, commonly used to convert lists of real numbers into lists of probabilities. This function serves as a smooth approximation to the argmax function, enabling natural exploration by giving lower-preference actions a non-zero chance of being selected.
Update Rule
After selecting an action Atβ at time t, the preference values are updated using the following rule:
βHt+1β(Atβ)βHtβ(Atβ)+Ξ±(RtββRΛtβ)(1βΟ(Atβ))Ht+1β(a)βHtβ(a)βΞ±(RtββRΛtβ)Ο(a)βaξ =Atββwhere:
- Ξ± is the step-size;
- Rtβ is the reward received;
- RΛtβ is the average reward observed so far.
Intuition
On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:
- If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered;
- If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.
Sample Code
def softmax(x):
"""Simple softmax implementation"""
return np.exp(x) / np.sum(np.exp(x))
class GradientBanditsAgent:
def __init__(self, n_actions, alpha):
"""Initialize an agent"""
self.n_actions = n_actions # Number of available actions
self.alpha = alpha # alpha
self.H = np.zeros(n_actions) # Preferences
self.reward_avg = 0 # Average reward
self.t = 0 # Time step counter
def select_action(self):
"""Select an action according to the gradient bandits strategy"""
# Compute probabilities from preferences with softmax
probs = softmax(self.H)
# Choose an action according to the probabilities
return np.random.choice(self.n_actions, p=probs)
def update(self, action, reward):
"""Update preferences"""
# Increase the time step counter
self.t += 1
# Update the average reward
self.reward_avg += reward / self.t
# Compute probabilities from preferences with softmax
probs = softmax(self.H) # Getting action probabilities from preferences
# Update preference values using stochastic gradient ascent
self.H -= self.alpha * (reward - self.reward_avg) * probs
self.H[action] += self.alpha * (reward - self.reward_avg)
Additional Information
Gradient bandits have several interesting properties:
- Preference relativity: the absolute values of action preferences do not affect the action selection process β only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
- Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
- Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.
Summary
Gradient bandits provide a powerful alternative to traditional bandit algorithms by leveraging preference-based learning. Their most interesting feature is ability to naturally balance exploration and exploitation.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.7
Gradient Bandits Algorithm
Swipe to show menu
When dealing with multi-armed bandits, traditional methods like epsilon-greedy and UCB estimate action values to decide which action to take. However, gradient bandits take a different approach β they learn preferences for actions instead of estimating their values. These preferences are adjusted over time using stochastic gradient ascent.
Preferences
Instead of maintaining action value estimates Q(a), gradient bandits maintain preference values H(a) for each action a. These preferences are updated using a stochastic gradient ascent approach to maximize expected rewards. Each action's probability is computed using a softmax function:
P(Atβ=a)=βb=1nβeHtβ(b)eHtβ(a)β=Οtβ(a)where:
- Htβ(a) is the preference for action a on a time step t;
- P(Atβ=a) is the probability of selecting action a on a time step t;
- The denominator ensures that probabilities sum to 1.
Softmax is a crucial function in ML, commonly used to convert lists of real numbers into lists of probabilities. This function serves as a smooth approximation to the argmax function, enabling natural exploration by giving lower-preference actions a non-zero chance of being selected.
Update Rule
After selecting an action Atβ at time t, the preference values are updated using the following rule:
βHt+1β(Atβ)βHtβ(Atβ)+Ξ±(RtββRΛtβ)(1βΟ(Atβ))Ht+1β(a)βHtβ(a)βΞ±(RtββRΛtβ)Ο(a)βaξ =Atββwhere:
- Ξ± is the step-size;
- Rtβ is the reward received;
- RΛtβ is the average reward observed so far.
Intuition
On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:
- If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered;
- If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.
Sample Code
def softmax(x):
"""Simple softmax implementation"""
return np.exp(x) / np.sum(np.exp(x))
class GradientBanditsAgent:
def __init__(self, n_actions, alpha):
"""Initialize an agent"""
self.n_actions = n_actions # Number of available actions
self.alpha = alpha # alpha
self.H = np.zeros(n_actions) # Preferences
self.reward_avg = 0 # Average reward
self.t = 0 # Time step counter
def select_action(self):
"""Select an action according to the gradient bandits strategy"""
# Compute probabilities from preferences with softmax
probs = softmax(self.H)
# Choose an action according to the probabilities
return np.random.choice(self.n_actions, p=probs)
def update(self, action, reward):
"""Update preferences"""
# Increase the time step counter
self.t += 1
# Update the average reward
self.reward_avg += reward / self.t
# Compute probabilities from preferences with softmax
probs = softmax(self.H) # Getting action probabilities from preferences
# Update preference values using stochastic gradient ascent
self.H -= self.alpha * (reward - self.reward_avg) * probs
self.H[action] += self.alpha * (reward - self.reward_avg)
Additional Information
Gradient bandits have several interesting properties:
- Preference relativity: the absolute values of action preferences do not affect the action selection process β only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
- Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
- Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.
Summary
Gradient bandits provide a powerful alternative to traditional bandit algorithms by leveraging preference-based learning. Their most interesting feature is ability to naturally balance exploration and exploitation.
Thanks for your feedback!