On Tuesday, AI startup Anthropic detailed the specific principles of its "Constitutional AI" training approach that provides its Claude chatbot with explicit "values." It aims to address concerns about transparency, safety, and decision-making in AI systems without relying on human feedback to rate responses.
Claude is an AI chatbot similar to OpenAI's ChatGPT that Anthropic released in March.
"We’ve trained language models to be better at responding to adversarial questions, without becoming obtuse and saying very little," Anthropic wrote in a tweet announcing the paper. "We do this by conditioning them with a simple set of behavioral principles via a technique called Constitutional AI."
Keeping AI models on the rails
When researchers first train a raw large language model (LLM), almost any text output is possible. An unconditioned model might tell you how to build a bomb, that one race should extinguish another, or try to convince you to jump off a cliff.
Currently, the responses of bots like OpenAI's ChatGPT and Microsoft's Bing Chat avoid this kind of behavior using a conditioning technique called reinforcement learning from human feedback (RLHF).
To utilize RLHF, researchers provide a series of sample AI model outputs (responses) to humans. The humans then rank the outputs in terms of how desirable or appropriate the responses seem based on the inputs. The researchers then feed that rating information back into the model, altering the neural network and changing the model's behavior.
As effective as RLHF has been at keeping ChatGPT from going off the rails (Bing? Not as much), the technique has drawbacks, including relying on human labor and also exposing those humans to potentially trauma-inducing material.