Implementing Q-learning in Python
Scorri per mostrare il menu
The epsilon-greedy strategy is a simple but effective way to balance exploration and exploitation in reinforcement learning. At each step, the agent selects a random action with probability epsilon (exploration), and the best-known action with probability 1 - epsilon (exploitation). This approach prevents the agent from getting stuck with suboptimal actions and encourages discovering new, potentially better strategies.
import numpy as np
# Define grid-world dimensions
n_rows, n_cols = 4, 4
n_states = n_rows * n_cols
n_actions = 4 # up, down, left, right
# Initialize Q-table with zeros
Q = np.zeros((n_states, n_actions))
# Define rewards for each state (simple example)
rewards = np.full((n_rows, n_cols), -1)
rewards[3, 3] = 10 # goal state at bottom-right corner
# Function to convert (row, col) to state index
def to_state(row, col):
return row * n_cols + col
import random
# Q-learning parameters
alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.2 # Exploration rate
episodes = 500
for episode in range(episodes):
row, col = 0, 0 # Start at top-left corner
done = False
while not done:
state = to_state(row, col)
# Epsilon-greedy action selection
if random.uniform(0, 1) < epsilon:
action = random.randint(0, n_actions - 1)
else:
action = np.argmax(Q[state])
# Take action
if action == 0: # up
next_row, next_col = max(row - 1, 0), col
elif action == 1: # down
next_row, next_col = min(row + 1, n_rows - 1), col
elif action == 2: # left
next_row, next_col = row, max(col - 1, 0)
else: # right
next_row, next_col = row, min(col + 1, n_cols - 1)
reward = rewards[next_row, next_col]
next_state = to_state(next_row, next_col)
# Q-value update
Q[state, action] = Q[state, action] + alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
row, col = next_row, next_col
if (row, col) == (n_rows - 1, n_cols - 1):
done = True
Note: To visualize the learned policy, extract the action with the highest Q-value for each state from the Q-table. You can display arrows or symbols on the grid to represent the optimal action at each position, helping you understand the agent's preferred path to the goal.
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione