foropenai

Models can be open and safe at the same time

This post is a counterclaim refutation

One of the most common arguments that AI shouldn't be open is that open models are inherently unsafe. It's the reason why OpenAI switched to not publishing its models. In reality, open models can be just as safe as closed models.

But first, you first need to understand how safety training works.

Reinforcement learning from human feedback (RLHF)

RLHF diagram.svg
CC BY-SA 4.0

RLHF is a method to improve an AI's responses, including improving safety. We train a model to imitate how humans rank responses from the AI, called the reward model. If humans prefer more helpful responses, the reward model will also prefer more helpful responses.

Then we tune our AI, the supervised model, using a technique like PPO and the reward model. If the reward model prefers more helpful responses, the AI will start generating more helpful responses.

RLHF has been applied to the majority of modern language models, since it increases quality and safety of responses. OpenAI has led research in the area and applies it to almost all of their models, from GPT-2 (preventing abusive content) to InstructGPT (increasing truthfulness and safety) and ChatGPT (applying a conversational format and increasing safety).

Can open models be safe?

We believe the answer is yes. Closed models are not substantially more safe than open models.

Most open models have safeguards built in. For example, Llama 2 was trained to answer responsibly , and Mistral's models have an easily enabled guardrail prompt.

What if a closed model was suddenly released, like GPT-4? GPT-4 already has strong safety measures , and these would stay in place when released. If GPT-4 refuses a prompt, open GPT-4 would also refuse the prompt. If open GPT-4 mistakenly allows a prompt, normal GPT-4 would also allow that prompt.

It's worth touching on the fact that you can reduce an open model's safety by modifying it. However, this is less a problem with open models and more a consequence of the ability to fine tune models; even some closed models can be modified in this way.