Lernen Q-learning: Theory and Algorithm | Classic RL Algorithms: Q-learning & SARSA

Swipe um das Menü anzuzeigen

Q-learning is a foundational algorithm in reinforcement learning that allows an agent to learn the optimal actions to take in a given environment by updating a table of action values, known as the Q-table. The core of Q-learning is its update rule, which is derived from the Bellman equation. This equation expresses the relationship between the value of a state-action pair and the expected rewards from taking that action, followed by the best possible future actions. Q-learning is a model-free, temporal difference method: the agent repeatedly updates its estimates of Q-values based on new experiences, gradually improving its policy over time.

Bellman Equation for Q-values:

Q(s, a) = r + \gamma \max_{a'} Q(s', a')

This equation states that the Q-value for a state s and action a is equal to the immediate reward r plus the discounted maximum Q-value of the next state s' over all possible actions a'. The discount factor gamma controls how much future rewards are considered. The Bellman equation is fundamental in Q-learning, as it defines how the agent should update its Q-values to reflect both immediate and expected future rewards. By repeatedly applying this update, Q-learning enables the agent to learn the value of actions through value iteration, gradually converging on the optimal policy.

The Q-learning update rule is as follows: after observing a transition from state s to state s' by taking action a and receiving reward r, the Q-value is updated by moving it closer to the sum of the immediate reward and the maximum estimated future reward, discounted by a factor gamma. The learning rate alpha determines how much new information overrides old information. This process allows the agent to iteratively refine its understanding of which actions yield the highest long-term rewards.


              1234567891011121314151617181920212223242526
            
# Q-learning Algorithm Pseudocode

import numpy as np

# Initialize Q-table with zeros
Q = np.zeros((num_states, num_actions))

for episode in range(num_episodes):
    state = env.reset()
    done = False

    while not done:
        # Choose action: epsilon-greedy strategy
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(Q[state])        # Exploit

        next_state, reward, done, _, _ = env.step(action)

        # Q-learning update rule
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )

        state = next_state

Note

Note:
The learning rate (alpha) controls how quickly the Q-values are updated with new information. A high learning rate means recent experiences have more influence, but can cause instability. The discount factor (gamma) determines the importance of future rewards. A higher discount factor encourages the agent to consider long-term rewards, while a lower value focuses the agent on immediate rewards.

1. Which of the following best describes the Q-learning update rule?

2. Fill in the blanks to complete the Q-learning update equation:

Q(s, a) = Q(s, a) + ____ * [ ____ + ____ * max Q(s', a') - Q(s, a) ]

War alles klar?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 1

Fragen Sie AI

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 1