Lesson 7.1: Fusion and Alignment


Multimodal learning aims to integrate and align data from different modalities (e.g., text, images, audio, video) to improve model performance. The two core challenges are:

  • Fusion – How to combine information from multiple modalities effectively.
  • Alignment – How to establish meaningful relationships between modalities.

Fusion Methods

Fusion strategies determine when and how modalities are combined. The three main approaches are:

  • A. Early Fusion (Feature-Level Fusion)

    • Process: Raw data from all modalities are concatenated before feeding into the model.
      • Example: Combining image pixels + text embeddings → input to a neural network.
    • Pros:
      • Captures low-level interactions between modalities.
      • Computationally efficient (single model pipeline).
    • Cons:
      • Sensitive to noise/missing data.
      • Loses modality-specific nuances.
    • Use Cases: Simple tasks where modalities are tightly coupled (e.g., video + audio for lip-sync).
  • B. Mid Fusion (Intermediate-Level Fusion)

    • Process: Modalities are processed independently initially, then fused in intermediate layers.
    • Example:
      • Text → BERT → Intermediate features
      • Image → CNN → Intermediate features
      • Fused via attention or concatenation → Final prediction.
    • Pros:
      • Balances modality-specific and cross-modal learning.
      • Flexible (e.g., cross-modal attention in transformers).
    • Cons:
      • Complex to design (requires careful layer selection).
    • Use Cases: Vision-language tasks (e.g., VQA, image captioning).
  • C. Late Fusion (Decision-Level Fusion)

    • Process: Each modality is processed independently, and final predictions are combined (e.g., averaging, voting).
    • Example:
      • Text model predicts sentiment → Score 1
      • Audio model predicts sentiment → Score 2
      • Combined via weighted average.
    • Pros:
      • Robust to missing modalities.
      • Modular (easy to swap models).
    • Cons:
      • Ignores cross-modal interactions.
    • Use Cases: Ensemble-based systems (e.g., emotion recognition from face + voice).

Alignment Techniques

Alignment ensures modalities are semantically correlated. Key methods include:

  • A. Attention Mechanisms

    • Process: Dynamically weights features from different modalities based on relevance.
    • Example: In vision-language models, cross-modal attention lets text tokens "attend" to image regions (and vice versa).
    • Pros:
      • Interpretable (shows which modalities contribute to predictions).
      • Handles variable-length inputs.
    • Use Cases: Image captioning (e.g., OpenAI CLIP), multimodal transformers.
  • B. Maximum Mean Discrepancy (MMD)

    • Process: A statistical measure to minimize distribution differences between modalities.
    • Example: Aligns latent spaces of text and image embeddings by minimizing MMD loss.
    • Pros: Effective for unsupervised alignment.
    • Use Cases: Domain adaptation (e.g., aligning medical images + reports).
  • C. Contrastive Learning

    • Process: Pulls paired multimodal data closer in embedding space while pushing unpaired data apart.
    • Example: CLIP (Contrastive Language-Image Pretraining) aligns images + text via contrastive loss.
    • Pros: Works well with limited labeled data.
    • Use Cases: Multimodal retrieval (e.g., search images using text queries).
  • D. Graph-Based Alignment

    • Process: Treats modalities as nodes in a graph and aligns them via graph neural networks (GNNs).
    • Example: Aligning video frames (visual) + speech (audio) via relational graphs.
    • Pros: Captures complex, non-linear relationships.
    • Use Cases: Video understanding, multimodal knowledge graphs.
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.