Summary  
This chapter covers implementing the tabular Q-learning algorithm from scratch: initializing a Q-table, using an epsilon-greedy strategy for exploration vs. exploitation, updating Q-values via the Bellman equation in a training loop, and extracting the optimal policy.

General domain of usage  
Reinforcement learning in simulated environments

**Live Coding: Implementing Q-learning for a Grid-World**

In this video, you will see a step-by-step demonstration of how to implement the Q-learning algorithm from scratch in Python. The session covers setting up a simple grid-world environment, initializing the Q-table, and building the training loop with action selection and Q-value updates.

The **epsilon-greedy** strategy is a simple but effective way to balance **exploration** and **exploitation** in reinforcement learning. At each step, the agent selects a random action with probability `epsilon` (**exploration**), and the best-known action with probability `1 - epsilon` (**exploitation**). This approach prevents the agent from getting stuck with suboptimal actions and encourages discovering new, potentially better strategies.

```
import numpy as np

# Define grid-world dimensions
n_rows, n_cols = 4, 4
n_states = n_rows * n_cols
n_actions = 4  # up, down, left, right

# Initialize Q-table with zeros
Q = np.zeros((n_states, n_actions))

# Define rewards for each state (simple example)
rewards = np.full((n_rows, n_cols), -1)
rewards[3, 3] = 10  # goal state at bottom-right corner

# Function to convert (row, col) to state index
def to_state(row, col):
    return row * n_cols + col
```

```
import random

# Q-learning parameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 0.2    # Exploration rate
episodes = 500

for episode in range(episodes):
    row, col = 0, 0  # Start at top-left corner
    done = False

    while not done:
        state = to_state(row, col)

        # Epsilon-greedy action selection
        if random.uniform(0, 1) < epsilon:
            action = random.randint(0, n_actions - 1)
        else:
            action = np.argmax(Q[state])

        # Take action
        if action == 0:    # up
            next_row, next_col = max(row - 1, 0), col
        elif action == 1:  # down
            next_row, next_col = min(row + 1, n_rows - 1), col
        elif action == 2:  # left
            next_row, next_col = row, max(col - 1, 0)
        else:              # right
            next_row, next_col = row, min(col + 1, n_cols - 1)

        reward = rewards[next_row, next_col]
        next_state = to_state(next_row, next_col)

        # Q-value update
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )

        row, col = next_row, next_col

        if (row, col) == (n_rows - 1, n_cols - 1):
            done = True
```

**Note:** To visualize the learned policy, extract the action with the highest Q-value for each state from the Q-table. You can display arrows or symbols on the grid to represent the optimal action at each position, helping you understand the agent's preferred path to the goal.

Note

Which of the following best describes the epsilon-greedy strategy in Q-learning?

Practice implementing Q-learning and SARSA from scratch with Python. Includes step-by-step coding exercises, intuitive explanations, and simple tasks like grid-world navigation.

Explore the foundations of reinforcement learning, implement Q-learning and SARSA from scratch, and apply them to simple environments with hands-on coding, explanations, and challenges.

Implementing Q-learning in Python