Lesson 7.3: Contrastive Language–Image Pretraining (CLIP)

CLIP (Contrastive Language–Image Pretraining) is a groundbreaking multimodal model developed by OpenAI that learns visual concepts from natural language supervision. It bridges computer vision and natural language processing (NLP) by jointly training an image encoder and a text encoder to predict which images and text pairs belong together in a dataset.

CLIP Architecture

CLIP consists of two components - an image encoder and a text encoder.
The image encoder is a vision transformer that maps images into visual representations.
The text encoder is a transformer that maps text into semantic representations.
The image and text representations are then compared using a contrastive loss function during training.

How CLIP Works

Dual Encoders:
- CLIP consists of two separate encoders – one for images and one for text.
- The image encoder typically uses a vision transformer (ViT) architecture, while the text encoder uses a transformer-based language model.
Joint Embedding Space:
- The outputs of both encoders are projected into a joint embedding space where the similarity between image and text embeddings can be measured.
Training Objective:
- During training, CLIP learns to maximize the similarity between the correct pairs of images and texts while minimizing the similarity between mismatched pairs.