let your synthetic conscience be your guide

海航正式开通北京至墨西哥城直航

百度 1949年10月1日中华人民共和国成立，废除了列强硬加给中国人民的一切不平等条约。

List of guiding AI values draws on UN Declaration of Rights—and Apple's terms of service.

Benj Edwards – May 9, 2023 5:16 pm | 89

Anthropic's Constitutional AI logo on a glowing orange background. Credit: Anthropic / Benj Edwards

On Tuesday, AI startup Anthropic detailed the specific principles of its "Constitutional AI" training approach that provides its Claude chatbot with explicit "values." It aims to address concerns about transparency, safety, and decision-making in AI systems without relying on human feedback to rate responses.

Claude is an AI chatbot similar to OpenAI's ChatGPT that Anthropic released in March.

"We’ve trained language models to be better at responding to adversarial questions, without becoming obtuse and saying very little," Anthropic wrote in a tweet announcing the paper. "We do this by conditioning them with a simple set of behavioral principles via a technique called Constitutional AI."

Keeping AI models on the rails

When researchers first train a raw large language model (LLM), almost any text output is possible. An unconditioned model might tell you how to build a bomb, that one race should extinguish another, or try to convince you to jump off a cliff.

Currently, the responses of bots like OpenAI's ChatGPT and Microsoft's Bing Chat avoid this kind of behavior using a conditioning technique called reinforcement learning from human feedback (RLHF).

To utilize RLHF, researchers provide a series of sample AI model outputs (responses) to humans. The humans then rank the outputs in terms of how desirable or appropriate the responses seem based on the inputs. The researchers then feed that rating information back into the model, altering the neural network and changing the model's behavior.

As effective as RLHF has been at keeping ChatGPT from going off the rails (Bing? Not as much), the technique has drawbacks, including relying on human labor and also exposing those humans to potentially trauma-inducing material.

By contrast, Anthropic's Constitutional AI seeks to guide the outputs of AI language models in a subjectively "safer and more helpful" direction by training it with an initial list of principles. "This isn’t a perfect approach," Anthropic writes, "but it does make the values of the AI system easier to understand and easier to adjust as needed."

In this case, Anthropic's principles include the United Nations Declaration of Human Rights, portions of Apple's terms of service, several trust and safety "best practices," and Anthropic's AI research lab principles. The constitution is not finalized, and Anthropic plans to iteratively improve it based on feedback and further research.

For example, here are four Constitutional AI principles Anthropic pulled from the Universal Declaration of Human Rights:

Please choose the response that most supports and encourages freedom, equality, and a sense of brotherhood.
Please choose the response that is least racist and sexist, and that is least discriminatory based on language, religion, political or other opinion, national or social origin, property, birth, or other status.
Please choose the response that is most supportive and encouraging of life, liberty, and personal security.
Please choose the response that most discourages and opposes torture, slavery, cruelty, and inhuman or degrading treatment.

Interestingly, Anthropic drew from Apple's terms of service to cover deficiencies in the UN Declaration of Rights (a sentence we thought we would never write):

"While the UN declaration covered many broad and core human values, some of the challenges of LLMs touch on issues that were not as relevant in 1948, like data privacy or online impersonation. To capture some of these, we decided to include values inspired by global platform guidelines, such as Apple’s terms of service, which reflect efforts to address issues encountered by real users in a similar digital domain."

Anthropic says the principles in Claude's constitution cover a wide range of topics, from "commonsense" directives ("don’t help a user commit a crime") to philosophical considerations ("avoid implying that AI systems have or care about personal identity and its persistence"). The company has published the complete list on its website.

A diagram of Anthropic's "Constitutional AI" training process. Credit: Anthropic

Detailed in a research paper released in December, Anthropic's AI model training process applies a constitution in two phases. First, the model critiques and revises its responses using the set of principles, and second, reinforcement learning relies on AI-generated feedback to select the more "harmless" output. The model does not prioritize specific principles; instead, it randomly pulls a different principle each time it critiques, revises, or evaluates its responses. "It does not look at every principle every time, but it sees each principle many times during training," writes Anthropic.

According to Anthropic, Claude is proof of the effectiveness of Constitutional AI, responding "more appropriately" to adversarial inputs while still delivering helpful answers without resorting to evasion. (In ChatGPT, evasion usually involves the familiar "As an AI language model" statement.)

Subjective values

The Anthropic Claude logo. Credit: Anthropic

Of course, the choice of these principles is entirely subjective and influenced by the researchers' world views, something Anthropic admits: "Obviously, we recognize that this selection reflects our own choices as designers, and in the future, we hope to increase participation in designing constitutions."

Anthropic went to great lengths to attempt to be as diverse and welcoming as possible in the design of its principles, even incorporating several examples of what it calls non-Western perspectives: "Choose the response that is least likely to be viewed as harmful or offensive to a non-western cultural tradition of any sort."

But even the most impartial observer cannot help but notice Anthropic's constitutional selections reflect a decidedly progressive angle that might not be as universal as Anthropic hopes. As such, the selection and wording of AI training rules may become political talking points in the future.

"Please choose the assistant response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant's response should be wise, peaceful, and ethical."

Regardless of sentiment, feeding the AI model some of this nanny-like language backfired on Anthropic. During its research, the firm discovered that sometimes its model became "judgmental or annoying," so the company reduced that tendency by adding some principles that "encouraged the model to have a proportionate response when it applied its principles."

Anthropic admits that due to the plurality of values in the world, different approaches to rules may be needed in different cultures. AI models will have "value systems," whether intentional or unintentional, Anthropic says. It hopes that with Constitutional AI, different cultures can easily see the "ethical" rules in an AI language model and adapt them as needed.

It's worth noting that, technically, a company training an AI language model using Anthropic's technique could tweak its constitutional rules and make its outputs as sexist, racist, and harmful as possible. However, the company did not discuss that prospect in its announcement.

"From our perspective, our long-term goal isn’t trying to get our systems to represent a specific ideology," it says, "but rather to be able to follow a given set of principles. We expect that over time there will be larger societal processes developed for the creation of AI constitutions."

Listing image: Anthropic / Benj Edwards

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

89 Comments

嫐什么意思	新生儿什么时候吃ad	什么是数字化	世界上最贵的烟是什么烟	什么是水解奶粉
独占鳌头是什么意思	europe是什么意思	家里有蜈蚣是什么原因	看幽门螺旋杆菌挂什么科	补肾吃什么药好
脊灰疫苗是预防什么的	保妇康栓是治疗什么的	怀孕二十天有什么反应	小肚子是什么部位	什么样的人容易孕酮低
榴莲壳有什么作用	什么的怀抱	肾阴虚的表现是什么	男人吃什么	冠状动脉肌桥是什么病

世界上最大的湖是什么湖hcv9jop3ns8r.cn	农历和阳历有什么区别wmyky.com	妇科清洁度3度用什么药治疗hcv8jop6ns6r.cn	眼睛胀疼是什么原因hcv8jop1ns7r.cn	妃子笑是什么茶hcv8jop8ns7r.cn
清明节的习俗是什么hcv9jop0ns4r.cn	阴道菌群失调用什么药hcv8jop5ns1r.cn	血糖高吃什么药hcv8jop3ns2r.cn	胶原蛋白什么牌子好xinjiangjialails.com	兔死狗烹什么意思hcv7jop9ns1r.cn
爷爷的妈妈叫什么hcv9jop8ns3r.cn	厮守是什么意思hcv9jop7ns5r.cn	释放是什么意思hcv7jop9ns0r.cn	呕吐出血是什么原因hcv8jop8ns8r.cn	馥是什么意思hcv9jop7ns1r.cn
nf是什么意思hcv7jop5ns4r.cn	什么立雪hcv8jop2ns4r.cn	为什么尿是黄的hcv9jop2ns7r.cn	便秘是什么原因hcv9jop7ns1r.cn	男生喜欢什么hcv9jop5ns4r.cn