Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Q-learning: Theory and Algorithm | Classic RL Algorithms: Q-learning & SARSA
Hands-On Classic RL Algorithms with Python

Q-learning: Theory and Algorithm

Swipe um das Menü anzuzeigen

Q-learning is a foundational algorithm in reinforcement learning that allows an agent to learn the optimal actions to take in a given environment by updating a table of action values, known as the Q-table. The core of Q-learning is its update rule, which is derived from the Bellman equation. This equation expresses the relationship between the value of a state-action pair and the expected rewards from taking that action, followed by the best possible future actions. Q-learning is a model-free, temporal difference method: the agent repeatedly updates its estimates of Q-values based on new experiences, gradually improving its policy over time.

Bellman Equation for Q-values:

Q(s,a)=r+γmaxaQ(s,a)Q(s, a) = r + \gamma \max_{a'} Q(s', a')

This equation states that the Q-value for a state s and action a is equal to the immediate reward r plus the discounted maximum Q-value of the next state s' over all possible actions a'. The discount factor gamma controls how much future rewards are considered. The Bellman equation is fundamental in Q-learning, as it defines how the agent should update its Q-values to reflect both immediate and expected future rewards. By repeatedly applying this update, Q-learning enables the agent to learn the value of actions through value iteration, gradually converging on the optimal policy.

The Q-learning update rule is as follows: after observing a transition from state s to state s' by taking action a and receiving reward r, the Q-value is updated by moving it closer to the sum of the immediate reward and the maximum estimated future reward, discounted by a factor gamma. The learning rate alpha determines how much new information overrides old information. This process allows the agent to iteratively refine its understanding of which actions yield the highest long-term rewards.

1234567891011121314151617181920212223242526
# Q-learning Algorithm Pseudocode import numpy as np # Initialize Q-table with zeros Q = np.zeros((num_states, num_actions)) for episode in range(num_episodes): state = env.reset() done = False while not done: # Choose action: epsilon-greedy strategy if np.random.uniform(0, 1) < epsilon: action = env.action_space.sample() # Explore else: action = np.argmax(Q[state]) # Exploit next_state, reward, done, _, _ = env.step(action) # Q-learning update rule Q[state, action] = Q[state, action] + alpha * ( reward + gamma * np.max(Q[next_state]) - Q[state, action] ) state = next_state
Note
Note

Note:
The learning rate (alpha) controls how quickly the Q-values are updated with new information. A high learning rate means recent experiences have more influence, but can cause instability. The discount factor (gamma) determines the importance of future rewards. A higher discount factor encourages the agent to consider long-term rewards, while a lower value focuses the agent on immediate rewards.

1. Which of the following best describes the Q-learning update rule?

2. Fill in the blanks to complete the Q-learning update equation:

Q(s, a) = Q(s, a) + ____ * [ ____ + ____ * max Q(s', a') - Q(s, a) ]

question mark

Which of the following best describes the Q-learning update rule?

Wählen Sie die richtige Antwort aus

question-icon

Fill in the blanks to complete the Q-learning update equation:

Q(s, a) = Q(s, a) + ____ * [ ____ + ____ * max Q(s', a') - Q(s, a) ]

Klicken oder ziehen Sie Elemente und füllen Sie die Lücken aus

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 1

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 1
some-alt