Lesson 3.4: Gated Recurrent Units (GRU)

GRUs are a streamlined variant of LSTMs designed to capture long-range dependencies while being computationally lighter. They achieve this through two gating mechanisms (no separate cell state) that regulate information flow.

Fig: GRU architecture

Update Gate ( $z_t$ )

Role: Decides how much of the past hidden state to retain vs. how much to update with new information.
Equation : $z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)$ $z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}] + b_{z})$
- $σ$ : Sigmoid (outputs values between 0 and 1).
- Interpretation:
  - $z_t≈1$ : Keep most of the past state (like LSTM’s forget gate).
  - $z_t ≈0$ : Focus on new input (like LSTM’s input gate).

Reset Gate ( $r_t$ )
- Role: Determines how much of the past hidden state to ignore when computing the new candidate state.
- Equation: $r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$
- Interpretation:
  - $r_t ≈ 0$ : "Reset" (ignore past state, focus only on current input).
  - $r_t ≈ 1$ : Integrate past and current info.
Candidate Hidden State ( $\tilde{h}_t$ )

Role: Proposed new state based on the reset gate and current input.
Equation: $\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)$
Key:
- ⊙: Element-wise multiplication.
- If $\tilde{h}_t ≈ 0$ the candidate ignores $h_{t-1}$ (only uses $x_t$ ).

Final Hidden State ( $\tilde{h}_t$ )

Role: Blends the past state $h_{t−1}$ and candidate $\tilde{h}_t$ using the update gate.
Equation: $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$
Interpretation:
- $z_t ≈ 1$ Replace $h_{t-1}$ with $\tilde{h}_t$
- $z_t ≈ 0$ Keep $h_{t-1}$ unchanged.

Why GRUs Solve Vanishing Gradients

Additive Updates: Like LSTMs, GRUs use summation (not multiplication) to combine states, preserving gradients:
Gradient Paths:
- The reset gate $r_t$ can learn to ignore irrelevant history.
- The update gate $z_t$ can learn to preserve critical long-term info.

GRU vs. LSTM

Feature	GRU	LSTM
Gates	2 (update, reset)	3 (forget, input, output)
Cell State	None (hidden state only)	Explicit cell state ( $C_t$ )
Complexity	Fewer parameters (faster)	More expressive (slower)
Use Cases	Short-medium sequences	Very long sequences

GRUs simplify LSTMs by combining forget/input gates into an update gate and removing the cell state.
Reset Gate: Filters irrelevant past info.
Update Gate: Balances old/new memory.
Advantage: Faster to train than LSTMs while often achieving comparable performance.