Aprenda SARSA: On-Policy Learning | Classic RL Algorithms: Q-learning & SARSA

Deslize para mostrar o menu

SARSA, which stands for State-Action-Reward-State-Action, is a popular reinforcement learning algorithm that learns action values based on the actions actually taken by the agent. Unlike Q-learning, which is off-policy and updates its value estimates using the maximum possible reward for the next state, SARSA is an on-policy algorithm. This means it updates its Q-values using the reward and the action that the agent actually follows according to its current policy.

The SARSA update rule can be broken down as follows. At each step, the agent:

Observes the current state (s) and selects an action (a) based on its current policy (such as epsilon-greedy);
Executes the action, receives a reward (r), and observes the next state (s');
Chooses the next action (a') from state s' using the same policy;
Updates the Q-value for the state-action pair (s, a) using the reward, the next state, and the next action.

The update rule for SARSA is:

Q(s, a) ← Q(s, a) + α [r + γ Q(s', a') − Q(s, a)]

where:

α is the learning rate;
γ is the discount factor;
r is the reward received after taking action a in state s;
Q(s, a) is the current estimate of the value of taking action a in state s;
Q(s', a') is the value of the next state and action pair, as chosen by the current policy.

The key feature of SARSA's on-policy nature is that it uses the action actually taken by the agent (which may not be the optimal action) to update its Q-values. This makes SARSA sensitive to the exploration strategy used by the agent, and the learned policy tends to reflect the same level of exploration or risk as the behavior policy.

# SARSA pseudocode in Python (highlighting the on-policy update)

for episode in range(num_episodes):
    state = env.reset()
    action = select_action(state, Q, epsilon)  # Choose action using current policy
    done = False

    while not done:
        next_state, reward, done, info = env.step(action)
        next_action = select_action(next_state, Q, epsilon)  # Choose next action using current policy

        # SARSA update: use the action actually taken (on-policy)
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * Q[next_state, next_action] - Q[state, action]
        )

        state = next_state
        action = next_action

Note

Note: You might prefer SARSA over Q-learning when you want your agent’s learned policy to match its actual behavior, especially in environments where following the current (possibly exploratory) policy is safer or more realistic than always assuming optimal actions. SARSA’s on-policy nature makes it more conservative in risky environments, as it accounts for the agent’s tendency to explore.

Here is a comparison of Q-learning and SARSA in terms of their update rules, policy types, and common use cases:

Algorithm	Update Rule	Policy Type	Typical Use Cases
Q-learning	`Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') − Q(s, a)]`	Off-policy	When you want to learn the optimal policy, regardless of the agent's current behavior. Useful in deterministic or low-risk environments.
SARSA	`Q(s, a) ← Q(s, a) + α [r + γ Q(s', a') − Q(s, a)]`	On-policy	When you want the agent to learn a policy that matches its actual behavior, especially in risky or highly exploratory environments.

1. Which of the following best describes the difference between on-policy and off-policy learning?

2. Fill in the blanks to complete the SARSA update equation:

Q(s, a) ← Q(s, a) + α [r + γ _ _ _ − Q(s, a)]

Select the correct term to fill in the blank.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 5

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 5