Aprenda Markov Decision Processes: The Mathematical Framework | The Learning Loop: Foundations of Reinforcement Learning

Deslize para mostrar o menu

To understand how reinforcement learning agents make decisions, you need a clear framework that defines the environment, the agent's choices, and the consequences of those choices. This is where Markov Decision Processes (MDPs) come in.

An MDP is made up of:

States: the different situations or positions the agent can be in (each spot in the maze);
Actions: the choices available to the agent at each state (which direction to move);
Rewards: the feedback the agent gets after taking an action (like +1 for the exit, -1 for a trap);
Transitions: the rules or probabilities that determine what next state results from each action (moving right from one cell leads to the next cell to the right);

Using the maze analogy, you can think of the agent as someone trying to find the best way out, learning which moves lead to the goal and which lead to dead ends.


              1234567891011121314151617181920212223242526272829303132333435363738394041
            
# Define a simple maze as an MDP using Python dictionaries

states = ["A", "B", "C", "D"]  # Maze positions

actions = {
    "A": ["right", "down"],
    "B": ["left", "down"],
    "C": ["up", "right"],
    "D": ["up", "left"]
}

# Rewards for taking an action from a state
rewards = {
    ("A", "right"): 0,
    ("A", "down"): -1,
    ("B", "left"): 0,
    ("B", "down"): 1,   # Reaching the goal
    ("C", "up"): 0,
    ("C", "right"): -1,
    ("D", "up"): 0,
    ("D", "left"): -1
}

# Transitions: (current_state, action) -> next_state
transitions = {
    ("A", "right"): "B",
    ("A", "down"): "C",
    ("B", "left"): "A",
    ("B", "down"): "D",
    ("C", "up"): "A",
    ("C", "right"): "D",
    ("D", "up"): "B",
    ("D", "left"): "C"
}

# Example: What happens if the agent is in state "A" and takes action "right"?
current_state = "A"
action = "right"
next_state = transitions[(current_state, action)]
reward = rewards[(current_state, action)]
print(f"From {current_state}, take '{action}'  {next_state}, reward: {reward}")

The key feature that makes an MDP special is the Markov property. In simple terms, this means that the agent's future is determined only by its current state and the action it chooses, not by the sequence of states and actions that came before. In the maze, if you know your current position, you have all the information you need to decide your next move and predict the consequences. You do not need to remember the entire path you took to get there.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 4

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 4