Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Implementing SARSA in Python | Classic RL Algorithms: Q-learning & SARSA
Hands-On Classic RL Algorithms with Python

Implementing SARSA in Python

メニューを表示するにはスワイプしてください

import numpy as np
import gymnasium as gym

# Initialize grid-world environment
env = gym.make("FrozenLake-v1", is_slippery=False)  # Deterministic grid-world

num_states = env.observation_space.n
num_actions = env.action_space.n

# Initialize Q-table for SARSA
Q = np.zeros((num_states, num_actions))

# Set SARSA hyperparameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 0.1    # Exploration rate
episodes = 1000  # Number of training episodes
def epsilon_greedy(Q, state, epsilon):
    if np.random.rand() < epsilon:
        return np.random.choice(Q.shape[1])
    else:
        return np.argmax(Q[state])

for episode in range(episodes):
    state, _ = env.reset()
    action = epsilon_greedy(Q, state, epsilon)
    done = False

    while not done:
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        next_action = epsilon_greedy(Q, next_state, epsilon)

        # SARSA update rule
        Q[state, action] += alpha * (reward + gamma * Q[next_state, next_action] - Q[state, action])

        state = next_state
        action = next_action

In SARSA, the agent learns the value of the policy it is actually following, including the exploration steps. The Q-table is updated using the action chosen by the current policy (which may be exploratory), rather than the greedy action used in Q-learning. This on-policy approach can make SARSA more robust in environments where exploratory actions could lead to poor outcomes, as it directly incorporates the effects of exploration into the learning process. In grid-world, you may notice that SARSA's policy tends to avoid risky paths more often than Q-learning, especially when the environment has negative rewards or traps.

Note
Note

To evaluate and visualize the performance of SARSA versus Q-learning, you can record the total rewards obtained in each episode and plot them using matplotlib. This allows you to compare how quickly each algorithm learns and how consistent their performance is over time.

question mark

Which statement best describes how SARSA updates its policy in the grid-world environment?

正しい答えを選んでください

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 1.  6

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1.  6
some-alt