Posts by Tags

An adversarial lens towards aligned large language models

7 minute read

Published: August 06, 2023

Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

An adversarial lens towards aligned large language models

7 minute read

Published: August 06, 2023

Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

An adversarial lens towards aligned large language models

7 minute read

Published: August 06, 2023

Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

Improving instruction following capabilities using self-alignment

4 minute read

Published: August 24, 2023

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

Improving instruction following capabilities using self-alignment

4 minute read

Published: August 24, 2023

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

An adversarial lens towards aligned large language models

7 minute read

Published: August 06, 2023

Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

An adversarial lens towards aligned large language models

7 minute read

Published: August 06, 2023

Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

An adversarial lens towards aligned large language models

7 minute read

Published: August 06, 2023

Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

Reasoning in Large Language Models

5 minute read

Published: August 13, 2023

Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

Improving instruction following capabilities using self-alignment

4 minute read

Published: August 24, 2023

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

Improving instruction following capabilities using self-alignment

4 minute read

Published: August 24, 2023

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

Reasoning in Large Language Models

5 minute read

Published: August 13, 2023

Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

An adversarial lens towards aligned large language models

7 minute read

Published: August 06, 2023

Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

Reasoning in Large Language Models

5 minute read

Published: August 13, 2023

Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

An adversarial lens towards aligned large language models

7 minute read

Published: August 06, 2023

Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

Reasoning in Large Language Models

5 minute read

Published: August 13, 2023

Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

Reasoning in Large Language Models

5 minute read

Published: August 13, 2023

Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

Improving instruction following capabilities using self-alignment

4 minute read

Published: August 24, 2023

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

Improving instruction following capabilities using self-alignment

4 minute read

Published: August 24, 2023

The introduction of GPT-3 completely revolutionized natural language processing by enabling few-shot learning through prompt engineering rather than fine-tuning. However, language models still struggle at zero-shot performance on tasks dissimilar from their pretraining data.

Reasoning in Large Language Models

5 minute read

Published: August 13, 2023

Let’s start this blog with a task. We have to train a model which concatenates the last letters of 2 input words. For example, if the input words are ‘Elon’ and ‘Musk’, the model should return ‘nk’. If we use supervised learning to train said model, we will need many examples with variation of words containing different end letters to create a model which gives the correct output. One might argue that we can use few shot learning with LLMs like GPT-3 to solve this problem. However, the model still isn’t able to produce the right output.

An adversarial lens towards aligned large language models

7 minute read

Published: August 06, 2023

Since the public release of LLM-based chat assistants like ChatGPT, there has been a large emphasis on aligning AI language models to prevent the production of undesirable or harmful content. One approach is to use reinforcement learning from human preferences to optimize a pre-trained language model by learning a reward function based on human preferences [1]. Constitutional AI [2] further removes the need for “human” preferences by training a reward model from AI feedback refined using safety instructions. The recently released Llama-2 model [3] also uses safety and helpfulness criteria to learn an RLHF-like model that improves alignment in open-source LLMs.

Visual Prompting

8 minute read

Published: September 05, 2023

Large language models like GPT-3 can be prompted with in-context examples or instructions to complete tasks without fine-tuning the model’s parameters. Prompting allows handling open-ended queries without introducing large numbers of learnable parameters. However, manually crafting a successful prompt to maximize the likelihood of the desired output is challenging (Hard prompts). For specific downstream tasks, domain adaptation may be required. This motivates soft prompts - appending tunable vectors to the input to steer the model toward desired outputs. Soft prompts help handle low-data domains and improve generalization without exhaustive prompt engineering.

Visual Prompting

8 minute read

Published: September 05, 2023

Large language models like GPT-3 can be prompted with in-context examples or instructions to complete tasks without fine-tuning the model’s parameters. Prompting allows handling open-ended queries without introducing large numbers of learnable parameters. However, manually crafting a successful prompt to maximize the likelihood of the desired output is challenging (Hard prompts). For specific downstream tasks, domain adaptation may be required. This motivates soft prompts - appending tunable vectors to the input to steer the model toward desired outputs. Soft prompts help handle low-data domains and improve generalization without exhaustive prompt engineering.

Aakanksha Sanctis

Posts by Tags

ChatGPT

Claude

Constitutional AI

Humpback

InstructGPT

Llama2

RLHF

adversarial attacks

chain of thoughts

instruction tuning

large language models

least-to-most prompting

ml safety

prompting

reasoning

reasoning tasks

self-alignment

self-consistency

universal triggers

vision language models

visual prompting