Posts by Tags

ChatGPT

An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

Claude

An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

Constitutional AI

An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

Humpback

Improving instruction following capabilities using self-alignment

4 minute read

Published:

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

InstructGPT

Improving instruction following capabilities using self-alignment

4 minute read

Published:

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

Llama2

An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

RLHF

An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

adversarial attacks

An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

chain of thoughts

Reasoning in Large Language Models

5 minute read

Published:


Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

instruction tuning

Improving instruction following capabilities using self-alignment

4 minute read

Published:

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

large language models

Improving instruction following capabilities using self-alignment

4 minute read

Published:

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

Reasoning in Large Language Models

5 minute read

Published:


Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

least-to-most prompting

Reasoning in Large Language Models

5 minute read

Published:


Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

ml safety

An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

prompting

Reasoning in Large Language Models

5 minute read

Published:


Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

reasoning

Reasoning in Large Language Models

5 minute read

Published:


Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

reasoning tasks

Improving instruction following capabilities using self-alignment

4 minute read

Published:

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

self-alignment

Improving instruction following capabilities using self-alignment

4 minute read

Published:

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

self-consistency

Reasoning in Large Language Models

5 minute read

Published:


Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

universal triggers

An adversarial lens towards aligned large language models

7 minute read

Published:


Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

vision language models

Visual Prompting

8 minute read

Published:

Large language models like GPT-3 can be prompted with in-context examples or instructions to complete tasks without fine-tuning the model’s parameters. Prompting allows handling open-ended queries without introducing large numbers of learnable parameters. However, manually crafting a successful prompt to maximize the likelihood of the desired output is challenging (Hard prompts). For specific downstream tasks, domain adaptation may be required. This motivates soft prompts - appending tunable vectors to the input to steer the model toward desired outputs. Soft prompts help handle low-data domains and improve generalization without exhaustive prompt engineering.

visual prompting

Visual Prompting

8 minute read

Published:

Large language models like GPT-3 can be prompted with in-context examples or instructions to complete tasks without fine-tuning the model’s parameters. Prompting allows handling open-ended queries without introducing large numbers of learnable parameters. However, manually crafting a successful prompt to maximize the likelihood of the desired output is challenging (Hard prompts). For specific downstream tasks, domain adaptation may be required. This motivates soft prompts - appending tunable vectors to the input to steer the model toward desired outputs. Soft prompts help handle low-data domains and improve generalization without exhaustive prompt engineering.