学ぶ Exploration vs. Exploitation: The Multi-Armed Bandit Problem | The Learning Loop: Foundations of Reinforcement Learning

メニューを表示するにはスワイプしてください

In reinforcement learning, one of the most fundamental challenges is the exploration vs. exploitation dilemma. Imagine you are in a casino with a row of mysterious slot machines - these are the "multi-armed bandits." Each machine has its own hidden payout rate, but you don't know which one is best. Every time you pull an arm, you might get a reward, but the size and frequency of that reward are uncertain and unique to each machine.

Your goal is to win as much as possible. The dilemma: should you keep playing the machine that has given you the best payout so far (exploitation), or should you try other machines to discover if there's one with an even better reward (exploration)? This trade-off is at the heart of many real-world decision-making problems, from online recommendations to clinical trials.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344
            
# Simple Multi-Armed Bandit Simulation

import numpy as np

# Number of slot machines (arms)
n_arms = 3

# True payout probabilities for each arm (unknown to the agent)
true_means = np.array([0.2, 0.5, 0.75])

# Agent's estimated values for each arm
estimates = np.zeros(n_arms)
# Count of times each arm has been played
counts = np.zeros(n_arms)

# Number of rounds to play
n_rounds = 20

# Store rewards for analysis
rewards = []

# Epsilon-greedy strategy: explore with probability epsilon
epsilon = 0.2

for step in range(n_rounds):
    if np.random.rand() < epsilon:
        # Explore: choose a random arm
        arm = np.random.randint(n_arms)
    else:
        # Exploit: choose the best-known arm
        arm = np.argmax(estimates)

    # Simulate pulling the chosen arm
    reward = np.random.rand() < true_means[arm]
    reward = float(reward)
    rewards.append(reward)

    # Update counts and estimated value (average reward)
    counts[arm] += 1
    estimates[arm] += (reward - estimates[arm]) / counts[arm]

    print(f"Round {step+1}: Pulled arm {arm}, reward {reward:.0f}, new estimates: {estimates}")

print(f"Total reward: {sum(rewards)}")

To handle the exploration vs. exploitation dilemma, you can use strategies that blend both behaviors. One common approach is to try each slot machine a few times to gather information (exploration), and then focus on the one that seems most rewarding (exploitation). The epsilon-greedy strategy, as shown in the code, randomly explores with a small probability (epsilon), but otherwise exploits the arm with the highest estimated value. This way, you avoid missing out on potentially better options while still taking advantage of what you have learned.

Other strategies might include gradually reducing exploration over time or using more sophisticated techniques to estimate uncertainty. The key is to find a balance that maximizes your total rewards over many rounds.

The exploration vs. exploitation dilemma and the multi-armed bandit problem are at the heart of reinforcement learning. Reinforcement learning is all about agents learning to make decisions by balancing two key behaviors: exploration, where you try new actions to gather information, and exploitation, where you choose actions that have produced the best results so far. Your goal is always to maximize the total reward you receive over time, not just in a single step.

すべて明確でしたか？

フィードバックありがとうございます！

セクション 1. 章 5

AIに質問する

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1. 章 5