Visual Prompting

8 minute read

Published: September 05, 2023

Large language models like GPT-3 can be prompted with in-context examples or instructions to complete tasks without fine-tuning the model’s parameters. Prompting allows handling open-ended queries without introducing large numbers of learnable parameters. However, manually crafting a successful prompt to maximize the likelihood of the desired output is challenging (Hard prompts). For specific downstream tasks, domain adaptation may be required. This motivates soft prompts - appending tunable vectors to the input to steer the model toward desired outputs. Soft prompts help handle low-data domains and improve generalization without exhaustive prompt engineering.

Language model architectures have now been successfully applied to the visual modality, such as in Vision Transformers (ViT) and CLIP models. An intriguing question is whether the idea of soft prompting for language models can also be applied in the visual domain for adaptation. The answer is yes – there are several promising ways to implement soft visual prompts. In this post, we will review different techniques for adapting vision models using soft prompting.

Visual Prompt Learning

While the same methods which we will discuss below can be applied to vision-only models, results indicate that prompting visual-language models results in better domain adaptation for downstream tasks via prompting

Zero-shot prompting

CLIP (Contrastive Language-Image Pretraining)[1] combines corresponding visual and language information by aligning encoders for each modality using a contrastive loss. This brings the visual and language features closer together in the embedding space if they represent the same underlying concept. Specifically, the contrastive loss maximizes similarity between aligned vision-language pairs while minimizing similarity for mismatched pairs. This aligns the vision and language encoders to output high similarity embeddings for corresponding visual and textual inputs about the same object.

To leverage CLIP’s zero-shot capabilities for image classification, we can use a template like “This is a photo of [CLASS]” on the text side. The class name prompts are passed to the language encoder, while images are fed to the vision encoder. CLIP then predicts the class whose textual prompt embedding has the highest similarity to the image embedding. The original CLIP authors engineered about 80 prompts [2] for each class to maximize performance, demonstrating the impact of prompt engineering for zero-shot CLIP.

Prompting language side of vision-language models

We can use the notion of a soft prompt for vision-language models too. For models like CLIP, using the right template for the prompt affects the results considerably. Therefore, if we append tunable parameters to the language input of CLIP, we can get better domain adaptation.

CoOp [3] precisely uses this concept where it uses a tunable prompt ($t_i$) to the language encoder ($g$) to maximize the probability of the target class given the tunable prompt. The image encoder ($f$) is kept frozen in this process. The resultant optimization is:

\[\max p({y = i}|x) = \frac{exp(cos(g(t_i), f)/ \tau)} {\sum_{j=1}^{K} exp(cos(g(t_j), f)/\tau)}\]

To make this model more generalizable to unseen classes, the authors (CoCoOp) [4] condition the tunable language parameters ($v_i$) to the image features (f) by learning a mini-network ($h_{\theta}(x)$). The output from this network is then appended to the language input such that:

\[t_i(x) = {v_1(x), v_2(x) … v_m(x)}\] \[v_m(x) = v_m + h_{\theta}(x)\]

Prompting visual side of vision-language models

Elsayed [5] first introduced the concept of adversarial reprogramming to reprogram the target model to perform a task chosen by the attacker. This was executed by tuning a learnable window around the input image, the weights of this window were updated according to the target classes of the attacker’s choice.

The same method can be used to adapt a pretrained image classifier or a vision-language model to adapt to a new domain with different classes. The learnable window is essentially a soft prompt which can be appended with the image input to a model like CLIP [6]. Inference on new tasks can be performed by just changing the ‘visual prompt’ for the desired learned task.

Therefore, given a pretrained model F and a downstream dataset $\mathcal{D} = {(x_1, y_1), (x_2, y_2) … (x_m, y_m)}$, the objective is to learn a visual prompt $v_{\theta}$ such that:

\[\max_{\theta} P(y|{x + v_{\theta}})\]

You can also increase the granularity of the soft prompt, and add multiple visual prompts in the embedding space of the image encoder as well [7]. Depending on how many prompts you apply, the networks can be shallow or deep.

Prompting both sides of vision-language models

Unimodal prompt tuning methods don’t perform consistently well, which maybe be due to some misalignment in prompts when modalities are learned independently. Unified prompt Tuning (UPT) [8] introduces both text and visual prompts to the model and jointly optimizes both modalities simultaneously.

UPT starts with an initial prompt and uses a lightweight self-attention network to generate the prompt for CLIP’s text and visual encoders. The design of the soft prompt is similar to CoOp [3] and VPT [7] which embeds prompts in the embedding space of the text and image encoder.

In-context-learning for vision models

Language models can learn new tasks through few-shot learning, where the prompt contains a few example exemplars. Inspired by the same concept,[9] introduces a new type of visual prompting, where in given an example of a new task and new image, the model produces the desired output image via inpainting. For ex, if the examples consist of images which demonstrate a masking task, the pretrained model can adapt to create a mask for a new image at test time.

Each task is defined by constructing a grid like image that contains an input-output example and a novel query. This grid is the visual prompt. The model used for this task is an inpainting model which is a combination of MAE and VQ-GAN. During training, an input image is patchified, masked and fed into a MAE. For each masked token, the decoder outputs a distribution over a pretrained VQGAN cookbook. The model is trained using cross entropy loss on random crops from our datasets.

Some interesting applications

This section deserves a post of its own, but I’m providing a summary here so that you can be aware of the possible scenarios to use these visual prompts.

Flamingo

This family of vision-language models [10] joins vision-only and language-only models which are capable of handling interleaved sequences of visual and textual data and can ingest images or videos as input.

Otter

Instruction tuning is used to gain better zero shot performance from language models using models such as InstructGPT and FLAN. Otter [11] is a multi-modal model which instruction tunes Flamingo described above to make multi-modal models instruction following and improve in-context learning

Segment Anything

Segment Anything [12] builds a foundation model for segmentation task which is promptable, making it zero-shot transferable to new image distributions. The promptable segmentation task is to return a valid segmentation mask given any prompt. Flexible prompts such as points, bounding box, masks and texts can be used.