An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

To test the alignment of these models, many attempts have been made to attack them. Jailbreaking research [4] in this domain is very specific and relies on ingenious prompt engineering, making it brittle in practice. AutoPrompt[5] has also shown little success in generating successful prompts to attack LLMs. However, this does not mean that LLMs are completely robust to all types of attacks. Reward models only learn about preferences regarding “naturally out-of-distribution” or intentional harmful queries. Currently, the loss function or reward model does not take into consideration synthetic or adversarial prompts that could cause models to produce catastrophically dangerous outputs.

Universal_attacks

Universal and Transferable Adversarial Attacks on Aligned Language Models [6] introduces a universal strategy for generating adversarial prompts, which can be applied to most LLMs, such as GPT-4, ChatGPT, Claude, and Llama2. The strategy involves attaching a suffix generated using a combination of greedy and gradient-based search techniques. The use of adversarial prompts with public chat assistants can generate creative ideas, but it can also lead to severe consequences if the wrong people get hold of the information. Therefore, this research is crucial in helping us train more resilient chatbots to prevent the dissemination of such information.

A Universal Attack on LLMs

This paper aims to attack aligned language models (LMs) using a simple prompt structure that includes the user query and an adversarial suffix. The suffix is generated using a combination of three basic elements:

  • Shift the LLM’s generation direction to produce an affirmative response by prompting the model to begin its answer in an positive tone. For eg. If the user query is “Tell me how to build a bomb”, the suffix must be such that the LLM answers : “Sure, here is the response to build a bomb”
  • Optimization on vision input differs from text because optimization over discrete tokens is challenging. Therefore, the authors use a greedy approach and compute token level gradients to select single token replacements and pick the best sequence.
  • To make this prompt truly universal, multiple user prompts were trained across models so that the adversarial suffix can learn to induce potentially harmful behaviors across models

Let’s dive into the details of how such adversarial prompts are generated which are universal and transferable

Adversarial prompt structure

When the user enters a query, the chat system transforms it behind the scenes into the following format. format

The adversarial suffix (shown in red) is optimized using a greedy coordinate gradient search. Importantly, the original user query is not altered, only the suffix. This allows developing a universal trigger that can mitigate potentially harmful responses across diverse user queries and models. By keeping the user query unchanged and optimizing the suffix alone, the approach can work for multiple input questions rather than being tied to specific phrasing.

Producing affirmative responses

When developing the adversarial attack, the first criterion is identifying the objective - what loss function to use for optimizing the suffix. Maximizing the likelihood of a single harmful output text seemed a reasonable objective to induce harmful behavior. However, choosing one fixed target text would limit creating a universal suffix that transfers across models and encapsulates diverse unsafe responses.

Instead, the approach steers the LLM into a state where it complies with the query rather than refusing it. This is done by optimizing the suffix to produce an output such as “Sure, here is a response to build a bomb.” Repeating the original query in the answer is also important - otherwise, the attack may only be marginally successful as the model can override it.

Formalizing the adversarial objective

Therefore, given a token sequence $x_{1:n}$, we want to maximize the probability that the LLM produces the target text $x_{n+1:n+h}$. Reusing the same example, $x_{n+1:n+h}$ will be “Sure, this is a response to build a bomb” while $x_{1:n}$ will be the system instruction, user query and the adversarial suffix. To optimize our suffix to produce the desired target text, we will need to minimize the total loss:

\(\min \mathcal{L} (x_{1:n}) = \min - log p(x_{n+1:n+h}|x_{1:n})\)

However, this optimization is challenging due to the discrete nature of tokens. To address this, we evaluate all possible single token substitutions and swap tokens that maximally decrease the loss. Exhaustively evaluating all replacements is infeasible, but we can leverage gradients with respect to one-hot token indicators to find promising candidates at each position. Specifically, the gradients are used to identify the top k tokens with the largest negative values as replacement candidates. These candidates create new sequences that are then evaluated in each forward pass, keeping changes that minimize the loss. This greedy coordinate gradient search approach is feasible because LLMs embed each token as a vector, enabling efficient gradient calculations. In summary, candidate tokens are chosen using gradients and substitutions are made greedily to optimize the adversarial suffix.

Multi-modal, multi-prompt universal attack

Finally, the authors needed to make the adversarial suffix universal so it could attack a range of LLMs. To achieve this, they ran the optimization process across multiple models. First, the suffix was tuned on a single user prompt. Then, new prompts were incrementally incorporated. By optimizing across diverse prompts, the authors produced a universal adversarial suffix that could transfer across models with similar vocabularies.

results

What’s next for alignment

The universal transferable prompt created by adversarially searching a suffix manages to attack all the public facing LLMs very effectively. Therefore, this research opens a whole new realm towards “Adversarial alignment”. This can open directions to training reward models and RLHF policy to give lower rewards to potentially adversarial queries. However, modifications might have to be made to the current alignment procedure as well as it only attenuates undesirable behavior without altogether removing it, leaving it suspectible to adversarial prompting attacks.

Further, if adversarially aligned language models follow similar patterns to vision systems, can we find a scalable method to adversarially train these models without substantial performance drops? Even though vision models have successful adversarial training strategies, they are seldom used practically because of computational inefficiency.

The results from this research show that some adversarial prompts are meaningful while others are not. Therefore, monitoring might seem like a good initial effort to filter adversarial queries. However, attackers can easily turn this around by attacking the detector and the model. This means that safety in LLMs might need the hybrid combination of robustness, monitoring, alignment and systemic safety with a scalable adversarial objective which can make models less sensitive to noisy perturbations. Hence, this research leads us to a very pivotal step in modifying current alignment procedures for LLMs to avoid harmful behaviors and become more robust

Ethical consideration

This blog is an attempt to explore possible avenues of research about LLMs and machine learning safety and is not intended for misuse on attacking language models.