Lesson 9.1: Markov Decision Process (MDP)


The Markov Decision Process (MDP) is a fundamental mathematical framework for modeling sequential decision-making in reinforcement learning (RL). It provides a structured way to represent environments where an agent interacts with a system by taking actions, receiving rewards, and transitioning between states.

Key Components of an MDP

  • State Space (S): Set of all possible states the agent can be in (e.g., positions in a grid world).
  • Action Space (A): Set of all possible actions the agent can take (e.g., move left, right, up, down).
  • Transition Function: Probability of moving to state s' given current state s and action a.
  • Reward Function: Immediate reward received after transitioning from s to s' via action a.
  • Discount Factor: Determines how much future rewards are valued (0 ≤ γ ≤ 1).

3. How an MDP Works in RL

  1. At each time step t:
    • The agent observes the current state s_t.
    • It selects an action a_t (using a policy π(a | s)).
    • The environment transitions to a new state s_{t+1} with probability P(s_{t+1} | s_t, a_t).
    • The agent receives a reward r_t = R(s_t, a_t, s_{t+1}).
  2. Goal:
    • Maximize the expected cumulative (discounted) reward:
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.