Summary
This chapter demonstrates how to implement the SARSA on-policy temporal-difference learning algorithm by maintaining and updating a Q-table with an epsilon-greedy policy to learn state-action values.

General domain of usage
Reinforcement learning for sequential decision-making tasks.

**Video: Live coding session implementing SARSA for grid-world**

In this video, you will see a step-by-step live coding demonstration of how to implement the **SARSA algorithm** in Python for a simple grid-world environment. The session covers initializing the environment, defining the SARSA agent, and building the training loop with **on-policy action selection**.

```
import numpy as np
import gymnasium as gym

# Initialize grid-world environment
env = gym.make("FrozenLake-v1", is_slippery=False)  # Deterministic grid-world

num_states = env.observation_space.n
num_actions = env.action_space.n

# Initialize Q-table for SARSA
Q = np.zeros((num_states, num_actions))

# Set SARSA hyperparameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 0.1    # Exploration rate
episodes = 1000  # Number of training episodes
```



```
def epsilon_greedy(Q, state, epsilon):
    if np.random.rand() < epsilon:
        return np.random.choice(Q.shape[1])
    else:
        return np.argmax(Q[state])

for episode in range(episodes):
    state, _ = env.reset()
    action = epsilon_greedy(Q, state, epsilon)
    done = False

    while not done:
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        next_action = epsilon_greedy(Q, next_state, epsilon)

        # SARSA update rule
        Q[state, action] += alpha * (reward + gamma * Q[next_state, next_action] - Q[state, action])

        state = next_state
        action = next_action
```

In **SARSA**, the agent learns the value of the policy it is actually following, including the exploration steps. The **Q-table** is updated using the action chosen by the current policy (which may be exploratory), rather than the greedy action used in **Q-learning**. This **on-policy** approach can make SARSA more robust in environments where exploratory actions could lead to poor outcomes, as it directly incorporates the effects of exploration into the learning process. In grid-world, you may notice that SARSA's policy tends to avoid risky paths more often than Q-learning, especially when the environment has negative rewards or traps.

 To evaluate and visualize the performance of SARSA versus Q-learning, you can record the total rewards obtained in each episode and plot them using `matplotlib`. This allows you to compare how quickly each algorithm learns and how consistent their performance is over time.

Note

Which statement best describes how SARSA updates its policy in the grid-world environment?

Practice implementing Q-learning and SARSA from scratch with Python. Includes step-by-step coding exercises, intuitive explanations, and simple tasks like grid-world navigation.

Explore the foundations of reinforcement learning, implement Q-learning and SARSA from scratch, and apply them to simple environments with hands-on coding, explanations, and challenges.

Implementing SARSA in Python