Implementing Q-learning in Python
Свайпніть щоб показати меню
The epsilon-greedy strategy is a simple but effective way to balance exploration and exploitation in reinforcement learning. At each step, the agent selects a random action with probability epsilon (exploration), and the best-known action with probability 1 - epsilon (exploitation). This approach prevents the agent from getting stuck with suboptimal actions and encourages discovering new, potentially better strategies.
import numpy as np
# Define grid-world dimensions
n_rows, n_cols = 4, 4
n_states = n_rows * n_cols
n_actions = 4 # up, down, left, right
# Initialize Q-table with zeros
Q = np.zeros((n_states, n_actions))
# Define rewards for each state (simple example)
rewards = np.full((n_rows, n_cols), -1)
rewards[3, 3] = 10 # goal state at bottom-right corner
# Function to convert (row, col) to state index
def to_state(row, col):
return row * n_cols + col
import random
# Q-learning parameters
alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.2 # Exploration rate
episodes = 500
for episode in range(episodes):
row, col = 0, 0 # Start at top-left corner
done = False
while not done:
state = to_state(row, col)
# Epsilon-greedy action selection
if random.uniform(0, 1) < epsilon:
action = random.randint(0, n_actions - 1)
else:
action = np.argmax(Q[state])
# Take action
if action == 0: # up
next_row, next_col = max(row - 1, 0), col
elif action == 1: # down
next_row, next_col = min(row + 1, n_rows - 1), col
elif action == 2: # left
next_row, next_col = row, max(col - 1, 0)
else: # right
next_row, next_col = row, min(col + 1, n_cols - 1)
reward = rewards[next_row, next_col]
next_state = to_state(next_row, next_col)
# Q-value update
Q[state, action] = Q[state, action] + alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
row, col = next_row, next_col
if (row, col) == (n_rows - 1, n_cols - 1):
done = True
Note: To visualize the learned policy, extract the action with the highest Q-value for each state from the Q-table. You can display arrows or symbols on the grid to represent the optimal action at each position, helping you understand the agent's preferred path to the goal.
Дякуємо за ваш відгук!
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат