Lesson 9.1: Markov Decision Process (MDP)
The Markov Decision Process (MDP) is a fundamental mathematical framework for modeling sequential decision-making in reinforcement learning (RL). It provides a structured way to represent environments where an agent interacts with a system by taking actions, receiving rewards, and transitioning between states.
Key Components of an MDP
- State Space (S): Set of all possible states the agent can be in (e.g., positions in a grid world).
- Action Space (A): Set of all possible actions the agent can take (e.g., move left, right, up, down).
- Transition Function: Probability of moving to state
s'
given current states
and actiona
. - Reward Function: Immediate reward received after transitioning from
s
tos'
via actiona
. - Discount Factor: Determines how much future rewards are valued
(0 ≤ γ ≤ 1)
.
3. How an MDP Works in RL
- At each time step
t
:- The agent observes the current state
s_t
. - It selects an action
a_t
(using a policyπ(a | s)
). - The environment transitions to a new state
s_{t+1}
with probabilityP(s_{t+1} | s_t, a_t)
. - The agent receives a reward
r_t = R(s_t, a_t, s_{t+1})
.
- The agent observes the current state
- Goal:
- Maximize the expected cumulative (discounted) reward: