Learn Markov Decision Process

Definition

Markov decision process (MDP) is a mathematical framework used to model decision-making problems where an agent interacts with an environment over time.

Reinforcement learning problems are often framed as MDPs, which provide a structured way to define the problem. MDPs describe the environment using four key components: states, actions, transitions, and rewards. These components work together under the Markov property, which ensures that the future state depends only on the current state and action, not on past states.

The Four Components

State

Definition

A state $s$ is a representation of the environment at a specific point in time. The set of all possible states is called a state space $S$ .

A state is typically represented by a set of parameters that capture the relevant features of the environment. These parameters can include various aspects such as the position, velocity, rotation, etc.

Action

Definition

An action $a$ is a decision or a move made by the agent to influence the environment. The set of all possible actions is called an action space $A$ .

The set of possible actions usually depends on the current state.

Transition

Definition

Transition describes how the environment's state changes in response to the agent's action. The transition function $p$ specifies the probability of moving from one state to another, given a specific action.

In many cases, environments can be either deterministic or stochastic, meaning that the transition may be predictable or may involve some degree of randomness.

Reward

Definition

A reward $r$ is a numerical value received by the agent after taking an action in a particular state. The function that maps transitions to expected rewards is called the reward function $R$ .

Rewards steer the agent toward desirable behavior, and can be either positive or negative. Reward engineering is complex, as the agent may attempt to exploit the rewards.

Markov Property

The Markov property in a Markov decision process states that the next state and reward depend only on the current state and action, not on past information. This ensures a memoryless framework, simplifying the learning process.

Mathematically, this property can be described by this formula:

\begin{aligned} &P(R_{t+1} = r, S_{t+1} = s' | S_t, A_t)=\\ =&P(R_{t+1} = r, S_{t+1} = s' | S_0, A_0, R_1,..., S_{t-1}, A_{t-1}, R_t, S_t, A_t) \end{aligned}

where:

$S_t$ is a state at a time $t$ ;
$A_t$ is an action taken at a time $t$ ;
$R_t$ is a reward at a time $t$ .

Note

The memoryless nature of MDP doesn't mean past observations are ignored. The current state should encode all relevant historical information.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain more about how the Markov property simplifies reinforcement learning?

What are some examples of states, actions, transitions, and rewards in real-world problems?

How does reward engineering impact the behavior of reinforcement learning agents?

Swipe to show menu

Definition

Markov decision process (MDP) is a mathematical framework used to model decision-making problems where an agent interacts with an environment over time.

The Four Components

State

Definition

A state $s$ is a representation of the environment at a specific point in time. The set of all possible states is called a state space $S$ .

Action

Definition

An action $a$ is a decision or a move made by the agent to influence the environment. The set of all possible actions is called an action space $A$ .

The set of possible actions usually depends on the current state.

Transition

Definition

In many cases, environments can be either deterministic or stochastic, meaning that the transition may be predictable or may involve some degree of randomness.

Reward

Definition

Rewards steer the agent toward desirable behavior, and can be either positive or negative. Reward engineering is complex, as the agent may attempt to exploit the rewards.

Markov Property

Mathematically, this property can be described by this formula:

\begin{aligned} &P(R_{t+1} = r, S_{t+1} = s' | S_t, A_t)=\\ =&P(R_{t+1} = r, S_{t+1} = s' | S_0, A_0, R_1,..., S_{t-1}, A_{t-1}, R_t, S_t, A_t) \end{aligned}

where:

$S_t$ is a state at a time $t$ ;
$A_t$ is an action taken at a time $t$ ;
$R_t$ is a reward at a time $t$ .

Note

The memoryless nature of MDP doesn't mean past observations are ignored. The current state should encode all relevant historical information.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3