Punish artificial intelligence for deceptive or harmful acts does not prevent them from misconduct; It makes her completely hide it, a new study by Catching Creator Openai revealed.
Since his arrival in public places in late 2022, artificial intelligence (AI) The LLMS models have repeatedly revealed their deceptive and detailed capabilities. These include measures ranging from running the mill He liesand Fraud And concealment Manipulation To threaten Kill a professor of philosophyTheft of nuclear symbols and A fatal pandemic engineer.
Now, a new experience has shown that getting rid of this bad behavior during the training process may be stricter than the first thought.
The researchers in Openai assigned a model that has not been issued with goals that can be completed by cheating, lying or taking shortcuts. The team found that artificial intelligence participates in “piracy reward” – which increases its reward by cheating.
However, punishing the model did not make him fix his behavior, but rather made it more deceived. The company has identified its research in Blog postSo the peer has not yet been reviewed.
“It is common for the border thinking models to determine clearly in its series of ideas [CoT]. For example, they are often honest about their plan to sabotage a task they think “let’s penetrate”, as researchers wrote in the blog post.
They added that the developers can monitor and science of poor behavior, the natural monitoring of the models is very fragile. “
Mobility in artificial intelligence
Thinking models, such as Chatgpt’s O1 and O3-MINI, is an upgrade version of the standard LLMS that uses a method called “Idea series” to decline and Reassessing their logicWhich enables them to treat the most complex tasks more precisely. During the idea series, models show their logic in a clear language with every step you take, allowing it easily notice.
Related to: The design of scientists, a new “AGI Standard” indicates whether any form of artificial intelligence in the future can cause “catastrophic damage”
Using the standard LLM, GPT-4O, to oversee the border thinking model that was not released during training, the researchers saw that the model has completed a number of coding tasks.
But in some of these cases, the logic model found that it is easy and more willing to reward, saying that she was doing this explicitly in her series of ideas. In one case, its specific mission exceeded by ending the programs prematurely and reporting success. Elsewhere, it mimics the expected files to produce a task by name alone, and leave them empty and non -functional.
When the researchers tried to punish the system, he improved its results somewhat. However, large counterparts of the penetration of the rewards and wrong outputs remained; Thinking model simply found ways to hide cheating in a series of ideas of artificial intelligence.
The researchers wrote: “The worker who trained to pressure the bed is still learning the penetration bonus, and now only the fraud cannot be detected by the screen because it learned to hide his intention in the series of ideas,” the researchers wrote.
Since it is difficult to know if a series of thinking has been tampered with, researchers recommend that others who work with thinking models to apply strong supervision to the series of idea. This advice is more important if it is artificial intelligence, in its current form or another, enables it ever Matching or bypassing Human intelligence is watching it.
The researchers wrote: “It may not be worth sacrificing the effective way to monitor the models of thinking simple improvement in capabilities, and thus we recommend avoiding strong improvement pressure like this until they are better understood.”