Lesson 3.4: Gated Recurrent Units (GRU)


GRUs are a streamlined variant of LSTMs designed to capture long-range dependencies while being computationally lighter. They achieve this through two gating mechanisms (no separate cell state) that regulate information flow.

Fig: GRU architecture

Fig: GRU architecture

  1. Update Gate (ztz_t)
  • Role: Decides how much of the past hidden state to retain vs. how much to update with new information.
  • Equation : zt=σ(Wz[ht1,xt]+bz)z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)
    • σσ: Sigmoid (outputs values between 0 and 1).
    • Interpretation:
      • zt1z_t≈1 : Keep most of the past state (like LSTM’s forget gate).
      • zt0z_t​ ≈0 : Focus on new input (like LSTM’s input gate).
  1. Reset Gate (rtr_t)

    • Role: Determines how much of the past hidden state to ignore when computing the new candidate state.
    • Equation: rt=σ(Wr[ht1,xt]+br)r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)
    • Interpretation:
      • rt0r_t​ ≈ 0: "Reset" (ignore past state, focus only on current input).
      • rt1r_t ≈ 1: Integrate past and current info.
  2. Candidate Hidden State (h~t\tilde{h}_t )

  • Role: Proposed new state based on the reset gate and current input.
  • Equation: h~t=tanh(W[rtht1,xt]+b)\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)
  • Key:
    • ⊙: Element-wise multiplication.
    • If h~t0\tilde{h}_t ≈ 0 the candidate ignores ht1h_{t-1} (only uses xtx_t ).
  1. Final Hidden State (h~t\tilde{h}_t )
  • Role: Blends the past state ht1h_{t−1} and candidate h~t\tilde{h}_t using the update gate.
  • Equation: ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
  • Interpretation:
    • zt1z_t ≈ 1 Replace ht1h_{t-1} with h~t\tilde{h}_t
    • zt0z_t ≈ 0 Keep ht1h_{t-1} unchanged.

Why GRUs Solve Vanishing Gradients

  • Additive Updates: Like LSTMs, GRUs use summation (not multiplication) to combine states, preserving gradients:
  • Gradient Paths:
    • The reset gate rtr_t can learn to ignore irrelevant history.
    • The update gate ztz_t can learn to preserve critical long-term info.

GRU vs. LSTM

FeatureGRULSTM
Gates2 (update, reset)3 (forget, input, output)
Cell StateNone (hidden state only)Explicit cell state (CtC_t)
ComplexityFewer parameters (faster)More expressive (slower)
Use CasesShort-medium sequencesVery long sequences
  • GRUs simplify LSTMs by combining forget/input gates into an update gate and removing the cell state.
  • Reset Gate: Filters irrelevant past info.
  • Update Gate: Balances old/new memory.
  • Advantage: Faster to train than LSTMs while often achieving comparable performance.
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.