Learn Problem Introduction | Multi-Armed Bandit Problem

The multi-armed bandit (MAB) problem is a well-known challenge in reinforcement learning, decision-making, and probability theory. It involves an agent repeatedly choosing between multiple actions, each offering a reward from some fixed probability distribution. The goal is to maximize the return over a fixed number of time steps.

Origin of a Problem

The term "multi-armed bandit" originates from the analogy to a slot machine, often called a "one-armed bandit" due to its lever. In this scenario, imagine having multiple slot machines, or a slot machine that has multiple levers (arms), and each arm is associated with a distinct probability distribution for rewards. The goal is to maximize the return over a limited number of attempts by carefully choosing which lever to pull.

The Challenge

The MAB problem captures the challenge of balancing exploration and exploitation:

Exploration: trying different arms to gather information about their payouts;
Exploitation: pulling the arm that currently seems best to maximize immediate rewards.

A naive approach — playing a single arm repeatedly — might lead to suboptimal returns if a better arm exists but remains unexplored. Conversely, excessive exploration can waste resources on low-reward options.

Real-World Applications

While originally framed in gambling, the MAB problem appears in many fields:

Online advertising: choosing the best ad to display based on user engagement;
Clinical trials: testing multiple treatments to find the most effective one;
Recommendation systems: serving the most relevant content to users.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu