A study into Contrastive Language–Image Pretraining

Note: This study is still in progress. Consider it a rough draft.

CLIP (Contrastive Language–Image Pretraining) is a joint embedding model that learns a shared semantic space between images and text.

The model works with text-image pairs with each data type encoded with either the image encoder or text encoder respectively. Both encoders produce a vector embedding within the same vector-space.

An example of the image embedding and text embedding within the same vector space

Its worth noting that the vector embeddings are of high dimensionality, for visualization they may be represented as 2 or 3 dimensions but in reality they are often of shape [1,512] which is a 512 dimension vector which is not suitable for visualization.

The two vectors are compared to each other using cosine similarity to measure how similar their directions are between the text embedding and the image embedding. Because these two embeddings are within the same pair, they represent the ground truth and optimization function is rewarded for pushing the two embeddings to be as close as possible to one another, relative to all other pairs within the current batch.

The cosine similarity is then used within the cross-entropy loss function in order to calculate the gradients that are used to adjust both encoders weights and biases during back propagation.

In addition to comparing the pair vector embeddings to each other, all the pairs within the same batch are also compared to all other pairs with that batch using cosine similarity, however because we are now comparing embeddings that don’t belong in the same pair, they are treated as contrastive and the optimizer is penalized for similarity and encourage to produce embeddings that are not similar in direction.

For each batch, every pair is encourage to have embeddings that point in a similar direction to each other, while pointing in dissimilar directions to all other pairs.

The embeddings are compared to opposing embedding types

Text Embeddings -> Image Embeddings
Image Embeddings -> Text Embeddings

CLIP conventionally is trained with huge batch sizes of around 30,000+ pairs. This provides a huge variety of directional similarities or dissimilarity to optimize for, batches by standard shuffle per epoch.

The training objective is to minimize the angle between the text and image embedding within each pair and to increase the angle between contrastive pairs as much as possible. Thus why its a contrastive model.

Meaning-Space

CLIP trains two encoders to map different modalities into a shared semantic vector space

Image Encoder: learns which visual features correspond to linguistic concepts
Text Encoder: learns which linguistic features correspond to visual concepts

This is achieved by training the encoders so that embeddings from matching image–text pairs are directionally aligned, which over many batches causes semantically related images and texts to occupy nearby regions (clusters) in the shared embedding space.

Here we can see the semantic groupings in the vector-space. A UMAP projection has been used to reduced the dimensionality into 3 dim. MNIST Dataset

These clusters of semantically similar embeddings are an emergent behaviour from the gradual improvements from the encoder learning how to better encode embeddings relative to their inputs. Similar semantic embeddings result in grouping together over many iterations.

Training Outcome

As a result of both the encoders producing embeddings that are positionally clustered with similar semantic concepts, the space now has meaning.

The image encoder can be used on its own to create meaningful embeddings based on input images. Each image embedding will now be semantically aligned with the linguistic meaning of that image and in the “neighbourhood” of similar semantic concepts.
The text encoder can be used on its own to create meaningful embeddings. Each text embedding will now be semantically aligned with the visual meaning of that input text and in the “neighbourhood” of similar semantic concepts.

The high dimensional semantically aligned vector space forms a continuous manifold with no hard divisions or discreet components – This allow for zero-shot learning, interpolation and diffusion conditioning.

Continuous semantic manifold

Continuous space (not discrete labels)
Smooth (small moves = small semantic change)
Shared across modalities

Zero-Shot generalization

CLIP can respond predictably to unseen inputs as it only needs the semantic direction implied by the text

“a photo of a red ceramic teapot in the style of van gogh”

CLIP has likely never seen this exact phrase. But each component lies in a nearby region of the manifold.

the resulting embedding will still produce a valid point that is interpretable and meaningful.

Semantic interpolation

0.5 * (“cat”) + 0.5 * (“dog”) → “cat–dog hybrid”

Directions interpolate meaningfully
No sharp class boundaries exist

Compositionality

CLIP embeddings compose roughly linearly

“photo” + “oil painting”
“realistic” − “photo”
“style A” + “subject B”

Meaning is encoded geometrically
Not symbolically

Tokens Vs Pooled

When using the encoder for inference vs training the text embeddings are handled differently

When in training the whole sentence or wording is created into 1 vector embedding representing a single semantic direction in the vector space.

However when using the text encoder for inference, the text input is maintained as tokens.

For example:

Input text

A red dog with a top hat

Training

In training this whole sentence or group of words will be used as the text that pairs to a single image.

1) Text -> tokenized to 77 tokens
2) Tokens -> processed by a Transformer
3) The sequence is collapsed to ONE vector of dimension D (e.g. 512)
4) Only that one vector is used in the contrastive loss within training

The sentence is broken down into something like:

[token₁, token₂, token₃, … token₇₇]

It include a special EOS (end-of-sentence) token

The transformer ( text encoder ) produces contextual embeddings for every token.

output : [77, D]

At this point “red”, “dog”, “hat” are separate vectors, Meaning is distributed across tokens.

Each token embedding can be thought of as a direction in semantic space.

These 77 tokens are then collapsed into a single embedding that represents all 77 semantic directions.

The pooled sentence embedding is not a simple sum of those directions, but a learned, weighted aggregation (via attention) that combines them into a single semantic direction representing the whole sentence.

This occurs via the special EOS token that attends to all other tokens, its the final embedding that becomes a learned function with all token embeddings.

Inference

Although CLIP training uses a pooled sentence embedding for the contrastive loss, the text encoder must internally preserve token-level semantic structure; as a result, individual token embeddings (e.g. “red”, “dog”, “hat”) retain meaningful, separable directions that can be used independently at inference time, even though training supervision was applied only at the pooled level.