Lesson 7.3: Contrastive Language–Image Pretraining (CLIP)
CLIP (Contrastive Language–Image Pretraining) is a groundbreaking multimodal model developed by OpenAI that learns visual concepts from natural language supervision. It bridges computer vision and natural language processing (NLP) by jointly training an image encoder and a text encoder to predict which images and text pairs belong together in a dataset.
CLIP Architecture
- CLIP consists of two components - an image encoder and a text encoder.
 - The image encoder is a vision transformer that maps images into visual representations.
 - The text encoder is a transformer that maps text into semantic representations.
 - The image and text representations are then compared using a contrastive loss function during training.
 
How CLIP Works
- Dual Encoders:
- CLIP consists of two separate encoders – one for images and one for text.
 - The image encoder typically uses a vision transformer (ViT) architecture, while the text encoder uses a transformer-based language model.
 
 - Joint Embedding Space:
- The outputs of both encoders are projected into a joint embedding space where the similarity between image and text embeddings can be measured.
 
 - Training Objective:
- During training, CLIP learns to maximize the similarity between the correct pairs of images and texts while minimizing the similarity between mismatched pairs.