Lesson 9.1: Markov Decision Process (MDP)
The Markov Decision Process (MDP) is a fundamental mathematical framework for modeling sequential decision-making in reinforcement learning (RL). It provides a structured way to represent environments where an agent interacts with a system by taking actions, receiving rewards, and transitioning between states.
Key Components of an MDP
- State Space (S): Set of all possible states the agent can be in (e.g., positions in a grid world).
- Action Space (A): Set of all possible actions the agent can take (e.g., move left, right, up, down).
- Transition Function: Probability of moving to state
s'given current statesand actiona. - Reward Function: Immediate reward received after transitioning from
stos'via actiona. - Discount Factor: Determines how much future rewards are valued
(0 ≤ γ ≤ 1).
3. How an MDP Works in RL
- At each time step
t:- The agent observes the current state
s_t. - It selects an action
a_t(using a policyπ(a | s)). - The environment transitions to a new state
s_{t+1}with probabilityP(s_{t+1} | s_t, a_t). - The agent receives a reward
r_t = R(s_t, a_t, s_{t+1}).
- The agent observes the current state
- Goal:
- Maximize the expected cumulative (discounted) reward: