Lesson 7.3: Contrastive Language–Image Pretraining (CLIP)


CLIP (Contrastive Language–Image Pretraining) is a groundbreaking multimodal model developed by OpenAI that learns visual concepts from natural language supervision. It bridges computer vision and natural language processing (NLP) by jointly training an image encoder and a text encoder to predict which images and text pairs belong together in a dataset.

CLIP Architecture

  • CLIP consists of two components - an image encoder and a text encoder.
  • The image encoder is a vision transformer that maps images into visual representations.
  • The text encoder is a transformer that maps text into semantic representations.
  • The image and text representations are then compared using a contrastive loss function during training.

How CLIP Works

  • Dual Encoders:
    • CLIP consists of two separate encoders – one for images and one for text.
    • The image encoder typically uses a vision transformer (ViT) architecture, while the text encoder uses a transformer-based language model.
  • Joint Embedding Space:
    • The outputs of both encoders are projected into a joint embedding space where the similarity between image and text embeddings can be measured.
  • Training Objective:
    • During training, CLIP learns to maximize the similarity between the correct pairs of images and texts while minimizing the similarity between mismatched pairs.
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.