Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Implementing Q-learning in Python | Classic RL Algorithms: Q-learning & SARSA
Hands-On Classic RL Algorithms with Python

Implementing Q-learning in Python

Sveip for å vise menyen

The epsilon-greedy strategy is a simple but effective way to balance exploration and exploitation in reinforcement learning. At each step, the agent selects a random action with probability epsilon (exploration), and the best-known action with probability 1 - epsilon (exploitation). This approach prevents the agent from getting stuck with suboptimal actions and encourages discovering new, potentially better strategies.

import numpy as np

# Define grid-world dimensions
n_rows, n_cols = 4, 4
n_states = n_rows * n_cols
n_actions = 4  # up, down, left, right

# Initialize Q-table with zeros
Q = np.zeros((n_states, n_actions))

# Define rewards for each state (simple example)
rewards = np.full((n_rows, n_cols), -1)
rewards[3, 3] = 10  # goal state at bottom-right corner

# Function to convert (row, col) to state index
def to_state(row, col):
    return row * n_cols + col
import random

# Q-learning parameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 0.2    # Exploration rate
episodes = 500

for episode in range(episodes):
    row, col = 0, 0  # Start at top-left corner
    done = False

    while not done:
        state = to_state(row, col)

        # Epsilon-greedy action selection
        if random.uniform(0, 1) < epsilon:
            action = random.randint(0, n_actions - 1)
        else:
            action = np.argmax(Q[state])

        # Take action
        if action == 0:    # up
            next_row, next_col = max(row - 1, 0), col
        elif action == 1:  # down
            next_row, next_col = min(row + 1, n_rows - 1), col
        elif action == 2:  # left
            next_row, next_col = row, max(col - 1, 0)
        else:              # right
            next_row, next_col = row, min(col + 1, n_cols - 1)

        reward = rewards[next_row, next_col]
        next_state = to_state(next_row, next_col)

        # Q-value update
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )

        row, col = next_row, next_col

        if (row, col) == (n_rows - 1, n_cols - 1):
            done = True
Note
Note

Note: To visualize the learned policy, extract the action with the highest Q-value for each state from the Q-table. You can display arrows or symbols on the grid to represent the optimal action at each position, helping you understand the agent's preferred path to the goal.

question mark

Which of the following best describes the epsilon-greedy strategy in Q-learning?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 2

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 2
some-alt