CLIP (Contrastive Language–Image Pre-training)
Machine learning model developed by OpenAI that learns visual concepts from natural language descriptions, enabling it to understand images in a manner aligned with textual descriptions.
CLIP, which stands for Contrastive Language–Image Pre-training, is designed to bridge the gap between visual data and textual information by training on a diverse range of images paired with textual descriptions. This model employs a contrastive learning approach where it optimizes the compatibility between the text and the images that correspond with it. Unlike traditional computer vision models that require specific labels for each image, CLIP learns from a broad set of internet-collected data, allowing it to generalize well across a wide variety of visual tasks without needing task-specific training data.
CLIP was introduced by OpenAI in 2021. The model quickly gained attention for its robust performance across diverse visual tasks and its ability to perform zero-shot learning, where the model can correctly classify images into categories it has never explicitly been trained on, based only on the descriptions provided.
The development of CLIP at OpenAI was led by Alec Radford, among others, building on the organization’s extensive research in scaling up neural networks and their applications in natural language processing and computer vision. This model is part of a broader effort in AI to create more flexible and generally capable systems that can understand and interpret the world in ways more similar to humans.