SARSA: On-Policy Learning
Deslize para mostrar o menu
SARSA, which stands for State-Action-Reward-State-Action, is a popular reinforcement learning algorithm that learns action values based on the actions actually taken by the agent. Unlike Q-learning, which is off-policy and updates its value estimates using the maximum possible reward for the next state, SARSA is an on-policy algorithm. This means it updates its Q-values using the reward and the action that the agent actually follows according to its current policy.
The SARSA update rule can be broken down as follows. At each step, the agent:
- Observes the current state (
s) and selects an action (a) based on its current policy (such as epsilon-greedy); - Executes the action, receives a reward (
r), and observes the next state (s'); - Chooses the next action (
a') from states'using the same policy; - Updates the Q-value for the state-action pair (
s,a) using the reward, the next state, and the next action.
The update rule for SARSA is:
Q(s, a) ← Q(s, a) + α [r + γ Q(s', a') − Q(s, a)]
where:
αis the learning rate;γis the discount factor;ris the reward received after taking actionain states;Q(s, a)is the current estimate of the value of taking actionain states;Q(s', a')is the value of the next state and action pair, as chosen by the current policy.
The key feature of SARSA's on-policy nature is that it uses the action actually taken by the agent (which may not be the optimal action) to update its Q-values. This makes SARSA sensitive to the exploration strategy used by the agent, and the learned policy tends to reflect the same level of exploration or risk as the behavior policy.
# SARSA pseudocode in Python (highlighting the on-policy update)
for episode in range(num_episodes):
state = env.reset()
action = select_action(state, Q, epsilon) # Choose action using current policy
done = False
while not done:
next_state, reward, done, info = env.step(action)
next_action = select_action(next_state, Q, epsilon) # Choose next action using current policy
# SARSA update: use the action actually taken (on-policy)
Q[state, action] = Q[state, action] + alpha * (
reward + gamma * Q[next_state, next_action] - Q[state, action]
)
state = next_state
action = next_action
Note: You might prefer SARSA over Q-learning when you want your agent’s learned policy to match its actual behavior, especially in environments where following the current (possibly exploratory) policy is safer or more realistic than always assuming optimal actions. SARSA’s on-policy nature makes it more conservative in risky environments, as it accounts for the agent’s tendency to explore.
Here is a comparison of Q-learning and SARSA in terms of their update rules, policy types, and common use cases:
| Algorithm | Update Rule | Policy Type | Typical Use Cases |
|---|---|---|---|
| Q-learning | Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') − Q(s, a)] | Off-policy | When you want to learn the optimal policy, regardless of the agent's current behavior. Useful in deterministic or low-risk environments. |
| SARSA | Q(s, a) ← Q(s, a) + α [r + γ Q(s', a') − Q(s, a)] | On-policy | When you want the agent to learn a policy that matches its actual behavior, especially in risky or highly exploratory environments. |
1. Which of the following best describes the difference between on-policy and off-policy learning?
2. Fill in the blanks to complete the SARSA update equation:
Q(s, a) ← Q(s, a) + α [r + γ _ _ _ − Q(s, a)]
- Select the correct term to fill in the blank.
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo