Lesson 7.1: Fusion and Alignment
Multimodal learning aims to integrate and align data from different modalities (e.g., text, images, audio, video) to improve model performance. The two core challenges are:
- Fusion – How to combine information from multiple modalities effectively.
- Alignment – How to establish meaningful relationships between modalities.
Fusion Methods
Fusion strategies determine when and how modalities are combined. The three main approaches are:
-
A. Early Fusion (Feature-Level Fusion)
- Process: Raw data from all modalities are concatenated before feeding into the model.
- Example: Combining image pixels + text embeddings → input to a neural network.
- Pros:
- Captures low-level interactions between modalities.
- Computationally efficient (single model pipeline).
- Cons:
- Sensitive to noise/missing data.
- Loses modality-specific nuances.
- Use Cases: Simple tasks where modalities are tightly coupled (e.g., video + audio for lip-sync).
- Process: Raw data from all modalities are concatenated before feeding into the model.
-
B. Mid Fusion (Intermediate-Level Fusion)
- Process: Modalities are processed independently initially, then fused in intermediate layers.
- Example:
- Text → BERT → Intermediate features
- Image → CNN → Intermediate features
- Fused via attention or concatenation → Final prediction.
- Pros:
- Balances modality-specific and cross-modal learning.
- Flexible (e.g., cross-modal attention in transformers).
- Cons:
- Complex to design (requires careful layer selection).
- Use Cases: Vision-language tasks (e.g., VQA, image captioning).
-
C. Late Fusion (Decision-Level Fusion)
- Process: Each modality is processed independently, and final predictions are combined (e.g., averaging, voting).
- Example:
- Text model predicts sentiment → Score 1
- Audio model predicts sentiment → Score 2
- Combined via weighted average.
- Pros:
- Robust to missing modalities.
- Modular (easy to swap models).
- Cons:
- Ignores cross-modal interactions.
- Use Cases: Ensemble-based systems (e.g., emotion recognition from face + voice).
Alignment Techniques
Alignment ensures modalities are semantically correlated. Key methods include:
-
A. Attention Mechanisms
- Process: Dynamically weights features from different modalities based on relevance.
- Example: In vision-language models, cross-modal attention lets text tokens "attend" to image regions (and vice versa).
- Pros:
- Interpretable (shows which modalities contribute to predictions).
- Handles variable-length inputs.
- Use Cases: Image captioning (e.g., OpenAI CLIP), multimodal transformers.
-
B. Maximum Mean Discrepancy (MMD)
- Process: A statistical measure to minimize distribution differences between modalities.
- Example: Aligns latent spaces of text and image embeddings by minimizing MMD loss.
- Pros: Effective for unsupervised alignment.
- Use Cases: Domain adaptation (e.g., aligning medical images + reports).
-
C. Contrastive Learning
- Process: Pulls paired multimodal data closer in embedding space while pushing unpaired data apart.
- Example: CLIP (Contrastive Language-Image Pretraining) aligns images + text via contrastive loss.
- Pros: Works well with limited labeled data.
- Use Cases: Multimodal retrieval (e.g., search images using text queries).
-
D. Graph-Based Alignment
- Process: Treats modalities as nodes in a graph and aligns them via graph neural networks (GNNs).
- Example: Aligning video frames (visual) + speech (audio) via relational graphs.
- Pros: Captures complex, non-linear relationships.
- Use Cases: Video understanding, multimodal knowledge graphs.