Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Implementing SARSA in Python | Classic RL Algorithms: Q-learning & SARSA
Hands-On Classic RL Algorithms with Python

Implementing SARSA in Python

Svep för att visa menyn

import numpy as np
import gymnasium as gym

# Initialize grid-world environment
env = gym.make("FrozenLake-v1", is_slippery=False)  # Deterministic grid-world

num_states = env.observation_space.n
num_actions = env.action_space.n

# Initialize Q-table for SARSA
Q = np.zeros((num_states, num_actions))

# Set SARSA hyperparameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 0.1    # Exploration rate
episodes = 1000  # Number of training episodes
def epsilon_greedy(Q, state, epsilon):
    if np.random.rand() < epsilon:
        return np.random.choice(Q.shape[1])
    else:
        return np.argmax(Q[state])

for episode in range(episodes):
    state, _ = env.reset()
    action = epsilon_greedy(Q, state, epsilon)
    done = False

    while not done:
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        next_action = epsilon_greedy(Q, next_state, epsilon)

        # SARSA update rule
        Q[state, action] += alpha * (reward + gamma * Q[next_state, next_action] - Q[state, action])

        state = next_state
        action = next_action

In SARSA, the agent learns the value of the policy it is actually following, including the exploration steps. The Q-table is updated using the action chosen by the current policy (which may be exploratory), rather than the greedy action used in Q-learning. This on-policy approach can make SARSA more robust in environments where exploratory actions could lead to poor outcomes, as it directly incorporates the effects of exploration into the learning process. In grid-world, you may notice that SARSA's policy tends to avoid risky paths more often than Q-learning, especially when the environment has negative rewards or traps.

Note
Note

To evaluate and visualize the performance of SARSA versus Q-learning, you can record the total rewards obtained in each episode and plot them using matplotlib. This allows you to compare how quickly each algorithm learns and how consistent their performance is over time.

question mark

Which statement best describes how SARSA updates its policy in the grid-world environment?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 6

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 6
some-alt