Learn Gradient Bandits Algorithm | Multi-Armed Bandit Problem

When dealing with multi-armed bandits, traditional methods like epsilon-greedy and UCB estimate action values to decide which action to take. However, gradient bandits take a different approach — they learn preferences for actions instead of estimating their values. These preferences are adjusted over time using stochastic gradient ascent.

Preferences

Instead of maintaining action value estimates $Q(a)$ , gradient bandits maintain preference values $H(a)$ for each action $a$ . These preferences are updated using a stochastic gradient ascent approach to maximize expected rewards. Each action's probability is computed using a softmax function:

P(A_t = a) = \frac{e^{H_t(a)}}{\sum_{b=1}^n e^{H_t(b)}} = \pi_t(a)

where:

$H_t(a)$ is the preference for action $a$ on a time step $t$ ;
$P(A_t = a)$ is the probability of selecting action $a$ on a time step $t$ ;
The denominator ensures that probabilities sum to 1.

Softmax is a crucial function in ML, commonly used to convert lists of real numbers into lists of probabilities. This function serves as a smooth approximation to the $\argmax$ function, enabling natural exploration by giving lower-preference actions a non-zero chance of being selected.

Update Rule

After selecting an action $A_t$ at time $t$ , the preference values are updated using the following rule:

\begin{aligned} &H_{t+1}(A_t) \gets H_t(A_t) + \alpha (R_t - \bar R_t)(1 - \pi(A_t))\\ &H_{t+1}(a) \gets H_t(a) - \alpha (R_t - \bar R_t)\pi(a) \qquad \forall a \ne A_t \end{aligned}

where:

$\alpha$ is the step-size;
$R_t$ is the reward received;
$\bar R_t$ is the average reward observed so far.

Intuition

On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:

If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered;
If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.

Sample Code

def softmax(x):
  """Simple softmax implementation"""
  return np.exp(x) / np.sum(np.exp(x))

class GradientBanditsAgent:
  def __init__(self, n_actions, alpha):
    """Initialize an agent"""
    self.n_actions = n_actions # Number of available actions
    self.alpha = alpha # alpha
    self.H = np.zeros(n_actions) # Preferences
    self.reward_avg = 0 # Average reward
    self.t = 0 # Time step counter

  def select_action(self):
    """Select an action according to the gradient bandits strategy"""
    # Compute probabilities from preferences with softmax
    probs = softmax(self.H)
    # Choose an action according to the probabilities
    return np.random.choice(self.n_actions, p=probs)

  def update(self, action, reward):
    """Update preferences"""
    # Increase the time step counter
    self.t += 1
    # Update the average reward
    self.reward_avg += reward / self.t

    # Compute probabilities from preferences with softmax
    probs = softmax(self.H) # Getting action probabilities from preferences

    # Update preference values using stochastic gradient ascent
    self.H -= self.alpha * (reward - self.reward_avg) * probs
    self.H[action] += self.alpha * (reward - self.reward_avg)

Additional Information

Gradient bandits have several interesting properties:

Preference relativity: the absolute values of action preferences do not affect the action selection process — only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.

Summary

Gradient bandits provide a powerful alternative to traditional bandit algorithms by leveraging preference-based learning. Their most interesting feature is ability to naturally balance exploration and exploitation.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu

Preferences

P(A_t = a) = \frac{e^{H_t(a)}}{\sum_{b=1}^n e^{H_t(b)}} = \pi_t(a)

where:

$H_t(a)$ is the preference for action $a$ on a time step $t$ ;
$P(A_t = a)$ is the probability of selecting action $a$ on a time step $t$ ;
The denominator ensures that probabilities sum to 1.

Update Rule

After selecting an action $A_t$ at time $t$ , the preference values are updated using the following rule:

\begin{aligned} &H_{t+1}(A_t) \gets H_t(A_t) + \alpha (R_t - \bar R_t)(1 - \pi(A_t))\\ &H_{t+1}(a) \gets H_t(a) - \alpha (R_t - \bar R_t)\pi(a) \qquad \forall a \ne A_t \end{aligned}

where:

$\alpha$ is the step-size;
$R_t$ is the reward received;
$\bar R_t$ is the average reward observed so far.

Intuition

On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:

If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered;
If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.

Sample Code

def softmax(x):
  """Simple softmax implementation"""
  return np.exp(x) / np.sum(np.exp(x))

class GradientBanditsAgent:
  def __init__(self, n_actions, alpha):
    """Initialize an agent"""
    self.n_actions = n_actions # Number of available actions
    self.alpha = alpha # alpha
    self.H = np.zeros(n_actions) # Preferences
    self.reward_avg = 0 # Average reward
    self.t = 0 # Time step counter

  def select_action(self):
    """Select an action according to the gradient bandits strategy"""
    # Compute probabilities from preferences with softmax
    probs = softmax(self.H)
    # Choose an action according to the probabilities
    return np.random.choice(self.n_actions, p=probs)

  def update(self, action, reward):
    """Update preferences"""
    # Increase the time step counter
    self.t += 1
    # Update the average reward
    self.reward_avg += reward / self.t

    # Compute probabilities from preferences with softmax
    probs = softmax(self.H) # Getting action probabilities from preferences

    # Update preference values using stochastic gradient ascent
    self.H -= self.alpha * (reward - self.reward_avg) * probs
    self.H[action] += self.alpha * (reward - self.reward_avg)

Additional Information

Gradient bandits have several interesting properties:

Preference relativity: the absolute values of action preferences do not affect the action selection process — only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.

Summary

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 5