Lesson 9.1: Markov Decision Process (MDP)

The Markov Decision Process (MDP) is a fundamental mathematical framework for modeling sequential decision-making in reinforcement learning (RL). It provides a structured way to represent environments where an agent interacts with a system by taking actions, receiving rewards, and transitioning between states.

Key Components of an MDP

State Space (S): Set of all possible states the agent can be in (e.g., positions in a grid world).
Action Space (A): Set of all possible actions the agent can take (e.g., move left, right, up, down).
Transition Function: Probability of moving to state s' given current state s and action a.
Reward Function: Immediate reward received after transitioning from s to s' via action a.
Discount Factor: Determines how much future rewards are valued (0 ≤ γ ≤ 1).

3. How an MDP Works in RL

At each time step t:
- The agent observes the current state s_t.
- It selects an action a_t (using a policy π(a | s)).
- The environment transitions to a new state s_{t+1} with probability P(s_{t+1} | s_t, a_t).
- The agent receives a reward r_t = R(s_t, a_t, s_{t+1}).
Goal:
- Maximize the expected cumulative (discounted) reward: