Summary
This chapter explains how to implement a policy as a function that maps states to actions in code and how to iteratively improve that policy based on experience and feedback.

General domain of usage
Reinforcement learning

**Animated comparison of a random policy vs. a learned policy in a maze:**  
This short animation shows an agent navigating a simple maze. First, the agent follows a random policy, choosing directions at each intersection without any memory or strategy, resulting in a meandering, inefficient path. Then, after learning from experience, the agent uses a learned policy that guides it directly to the goal with minimal backtracking, demonstrating how strategies improve with learning.

A **policy** in reinforcement learning is the agent’s strategy for making decisions. Formally, a policy is a mapping from situations (also called "states") to actions. You can think of a policy as the agent’s internal rulebook: whenever the agent finds itself in a particular state, the policy tells it what action to take next.

Imagine you are controlling a robot in a maze. If you use a **random policy**, the robot chooses its next move without any plan - sometimes going left, sometimes right, sometimes forward, with no consideration of where it is or where it’s been. This often leads to wandering in circles or getting stuck at dead ends. In contrast, a **learned policy** is developed through experience. Over time, the agent notices which actions help it reach the goal more quickly and starts to favor those choices. The result is a more direct, efficient path through the maze, as the agent’s decisions become guided by what it has learned works best in each situation.

# Simple policy pseudocode for maze navigation

def simple_policy(state):
    """
    Given the current state, decide the next action.
    If at a dead end, turn around.
    Otherwise, move forward.
    """
    if state == "dead_end":
        action = "turn_around"
    else:
        action = "move_forward"
    return action

# Example usage:
state = "dead_end"
print(simple_policy(state))
state = "corridor"
print(simple_policy(state))

Policies are not fixed - they can be improved over time as the agent interacts with its environment and receives feedback. At first, an agent may act randomly, but as it gathers experience, it updates its policy to favor actions that lead to higher rewards. This process is at the heart of reinforcement learning: the agent continually refines its policy, learning from both successes and mistakes, to make better decisions in the future.

What best describes a policy in reinforcement learning?

Explore the theoretical foundations of reinforcement learning through a 'Learning Loop' narrative. Explore agent-environment interaction, rewards, policies, Markov Decision Processes, and the exploration vs. exploitation dilemma through intuitive analogies and thought experiments.

Explore the conceptual building blocks of reinforcement learning through a narrative-driven approach, focusing on the agent's learning loop, rewards, policies, Markov Decision Processes, and the exploration-exploitation dilemma.

Policies: Strategies for Decision-Making