Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Q-learning: Theory and Algorithm | Classic RL Algorithms: Q-learning & SARSA
Hands-On Classic RL Algorithms with Python

Q-learning: Theory and Algorithm

Sveip for å vise menyen

Q-learning is a foundational algorithm in reinforcement learning that allows an agent to learn the optimal actions to take in a given environment by updating a table of action values, known as the Q-table. The core of Q-learning is its update rule, which is derived from the Bellman equation. This equation expresses the relationship between the value of a state-action pair and the expected rewards from taking that action, followed by the best possible future actions. Q-learning is a model-free, temporal difference method: the agent repeatedly updates its estimates of Q-values based on new experiences, gradually improving its policy over time.

Bellman Equation for Q-values:

Q(s,a)=r+γmaxaQ(s,a)Q(s, a) = r + \gamma \max_{a'} Q(s', a')

This equation states that the Q-value for a state s and action a is equal to the immediate reward r plus the discounted maximum Q-value of the next state s' over all possible actions a'. The discount factor gamma controls how much future rewards are considered. The Bellman equation is fundamental in Q-learning, as it defines how the agent should update its Q-values to reflect both immediate and expected future rewards. By repeatedly applying this update, Q-learning enables the agent to learn the value of actions through value iteration, gradually converging on the optimal policy.

The Q-learning update rule is as follows: after observing a transition from state s to state s' by taking action a and receiving reward r, the Q-value is updated by moving it closer to the sum of the immediate reward and the maximum estimated future reward, discounted by a factor gamma. The learning rate alpha determines how much new information overrides old information. This process allows the agent to iteratively refine its understanding of which actions yield the highest long-term rewards.

1234567891011121314151617181920212223242526
# Q-learning Algorithm Pseudocode import numpy as np # Initialize Q-table with zeros Q = np.zeros((num_states, num_actions)) for episode in range(num_episodes): state = env.reset() done = False while not done: # Choose action: epsilon-greedy strategy if np.random.uniform(0, 1) < epsilon: action = env.action_space.sample() # Explore else: action = np.argmax(Q[state]) # Exploit next_state, reward, done, _, _ = env.step(action) # Q-learning update rule Q[state, action] = Q[state, action] + alpha * ( reward + gamma * np.max(Q[next_state]) - Q[state, action] ) state = next_state
Note
Note

Note:
The learning rate (alpha) controls how quickly the Q-values are updated with new information. A high learning rate means recent experiences have more influence, but can cause instability. The discount factor (gamma) determines the importance of future rewards. A higher discount factor encourages the agent to consider long-term rewards, while a lower value focuses the agent on immediate rewards.

1. Which of the following best describes the Q-learning update rule?

2. Fill in the blanks to complete the Q-learning update equation:

Q(s, a) = Q(s, a) + ____ * [ ____ + ____ * max Q(s', a') - Q(s, a) ]

question mark

Which of the following best describes the Q-learning update rule?

Velg det helt riktige svaret

question-icon

Fill in the blanks to complete the Q-learning update equation:

Q(s, a) = Q(s, a) + ____ * [ ____ + ____ * max Q(s', a') - Q(s, a) ]

Klikk eller dra`n`slipp elementer og fyll inn de tomme feltene

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 1

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 1
some-alt