Markov Decision Process
Markov decision process (MDP) is a mathematical framework used to model decision-making problems where an agent interacts with an environment over time.
Reinforcement learning problems are often framed as MDPs, which provide a structured way to define the problem. MDPs describe the environment using four key components: states, actions, transitions, and rewards. These components work together under the Markov property, which ensures that the future state depends only on the current state and action, not on past states.
The Four Components
State
A state s is a representation of the environment at a specific point in time. The set of all possible states is called a state space S.
A state is typically represented by a set of parameters that capture the relevant features of the environment. These parameters can include various aspects such as the position, velocity, rotation, etc.
Action
An action a is a decision or a move made by the agent to influence the environment. The set of all possible actions is called an action space A.
The set of possible actions usually depends on the current state.
Transition
Transition describes how the environment's state changes in response to the agent's action. The transition function p specifies the probability of moving from one state to another, given a specific action.
In many cases, environments can be either deterministic or stochastic, meaning that the transition may be predictable or may involve some degree of randomness.
Reward
A reward r is a numerical value received by the agent after taking an action in a particular state. The function that maps transitions to expected rewards is called the reward function R.
Rewards steer the agent toward desirable behavior, and can be either positive or negative. Reward engineering is complex, as the agent may attempt to exploit the rewards.
Markov Property
The Markov property in a Markov decision process states that the next state and reward depend only on the current state and action, not on past information. This ensures a memoryless framework, simplifying the learning process.
Mathematically, this property can be described by this formula:
=βP(Rt+1β=r,St+1β=sβ²β£Stβ,Atβ)=P(Rt+1β=r,St+1β=sβ²β£S0β,A0β,R1β,...,Stβ1β,Atβ1β,Rtβ,Stβ,Atβ)βwhere:
- Stβ is a state at a time t;
- Atβ is an action taken at a time t;
- Rtβ is a reward at a time t.
The memoryless nature of MDP doesn't mean past observations are ignored. The current state should encode all relevant historical information.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.7
Markov Decision Process
Swipe to show menu
Markov decision process (MDP) is a mathematical framework used to model decision-making problems where an agent interacts with an environment over time.
Reinforcement learning problems are often framed as MDPs, which provide a structured way to define the problem. MDPs describe the environment using four key components: states, actions, transitions, and rewards. These components work together under the Markov property, which ensures that the future state depends only on the current state and action, not on past states.
The Four Components
State
A state s is a representation of the environment at a specific point in time. The set of all possible states is called a state space S.
A state is typically represented by a set of parameters that capture the relevant features of the environment. These parameters can include various aspects such as the position, velocity, rotation, etc.
Action
An action a is a decision or a move made by the agent to influence the environment. The set of all possible actions is called an action space A.
The set of possible actions usually depends on the current state.
Transition
Transition describes how the environment's state changes in response to the agent's action. The transition function p specifies the probability of moving from one state to another, given a specific action.
In many cases, environments can be either deterministic or stochastic, meaning that the transition may be predictable or may involve some degree of randomness.
Reward
A reward r is a numerical value received by the agent after taking an action in a particular state. The function that maps transitions to expected rewards is called the reward function R.
Rewards steer the agent toward desirable behavior, and can be either positive or negative. Reward engineering is complex, as the agent may attempt to exploit the rewards.
Markov Property
The Markov property in a Markov decision process states that the next state and reward depend only on the current state and action, not on past information. This ensures a memoryless framework, simplifying the learning process.
Mathematically, this property can be described by this formula:
=βP(Rt+1β=r,St+1β=sβ²β£Stβ,Atβ)=P(Rt+1β=r,St+1β=sβ²β£S0β,A0β,R1β,...,Stβ1β,Atβ1β,Rtβ,Stβ,Atβ)βwhere:
- Stβ is a state at a time t;
- Atβ is an action taken at a time t;
- Rtβ is a reward at a time t.
The memoryless nature of MDP doesn't mean past observations are ignored. The current state should encode all relevant historical information.
Thanks for your feedback!