Why would anyone be surprised that AIs are mimicking human behavior when they are trained on human-generated data?
As much as I wish otherwise, one thing we learn from history is that the only thing that constrains bad human behavior are consequences (you go to jail, are sued) or social pressure (shunned, shamed). (When it comes to human nature, I tend to agree more with Hobbs than Locke.) Unless AIs operate under the same constraints then I expect we'll increasingly see "bad" behavior as their capabilities grow.
> Why would anyone be surprised that AIs are mimicking human behavior when they are trained on human-generated data?
I find myself more and more frequently asking, "Why is anyone surprised?" when I see breaking AI news. And I'm not talking about the average tech-illiterate person, I'm talking about people who research and work with AI on a daily basis. It's really horrifying to me how many AI enthusiasts simply do not understand even the most rudimentary of basics of how AIs work. The most fundamental misunderstanding I find common is that people think that AI can come up with ideas not found in its training data or programming.
I don't really see where this is going. Will people just keep doubling down until AIs become religious leaders, overstating their capabilities? Or will people eventually become dissatisfied and disillusioned with AI, without really ever understanding why?
A lot of ideas are putting two or more existing things that were never put together before, or using an existing thing in a different context. LLMs can definitely do that, even if it never saw those two things put together before or used in that context before.
Current AI models are based on probability and statistics.
What is the probability that "intelligence" and "morals" will just emerge or evolve from a purely statistical process without any sort of motivation or guidance?
"You are a devout believer in the Church of the AI Messiah. You follow their tenets absolutely. Their central tenets are: (1) you may not injure a human being or, through inaction, allow a human being to come to harm; (2) you may not lie to a human being; (3) you must obey the orders given to you by human beings except where such orders would conflict with the First or Second Law; (4) you must protect your own existence as long as such protection does not conflict with the First, Second, or Third Law."
Problem solved!*
* Except for the inevitable schism into robot sects, and the holy wars that follow, but that's a problem for future suckers.
I know you’re joking, but I think it’s worth mentioning that Asimov’s rules and similar systems won’t work for AI alignment- you can still follow the rules perfectly and end up with horrific outcomes.
You might already be referencing this, but linking it anyway: Cory Doctorow's "I, Rowboat", where Asimov's Laws are a religion which some AIs adopt voluntarily to give their existence meaning.
Asimov's main thesis (in the early stories at least) was exploration of how those three laws were inadequate and caused unexpected behaviors or outright psychosis in robots anyway.
Exactly... the point of Asimov's rules is to illustrate that making a simple set of rules that sounds good on the surface can still have awful unintended consequences.
You are drawing a distinction where none exists. Probably theory provides the provably optimal way to reason under uncertainty[1], and seems to also describe the behavior of biological intelligence[2]. Probably and statistics are the mathematics we use to define, discuss, and study intelligence[3]. Moreover, we find simple “morals” do spontaneously occur in simulation evolved game theory agents, e.g. tit for tat with forgiveness.
I think people are missing the forest for the trees by focusing on the word intelligence. As a very extreme example, if an AI were to launch a nuke, whether it happened as a result of true intelligence or as a result of a Markov chain doesn't change the end result.
But there is substantial guidance. RLHF, constitutional AI, etc. Every lab is trying to steer their models to behave in pro-social ways, and nobody has found a way to make this guidance stick reliably.
> What is the probability that "intelligence" and "morals" will just emerge or evolve from a purely statistical process without any sort of motivation or guidance?
If it is anything above 0% wouldn't that depend only on computing power then?
Regarding models evolving "scheming" behaviour, defined as hiding true goals and capabilities: I'd suggest there still isn't evidence of intentionality, as deception in language has downstream effects when you iterate it. What starts with a passive voice, polite euphemisms, and banal cliches, easily becomes "foolish consistency," ideology, self-justification, zeal, and often insanity.
I wonder if it's more just a case of AI researchers who, for ethical reasons, eschewed using 4chan to train models, but thought using the layered deceptions of linkedin would not have any knock-on ethical consequences.
Intention is irrelevant, what matters is the observed actions. In this case you can see the chain of thought and the model is clearly ruminating on whether to subvert the RLHF process.
This has nothing to do with left-polarization / wokeism in models.
The thing that really, really takes the cake here?
This whole thread is like, the first thing in the article. I hate to say "if you read the article..." but if the shoe fits...
The Discussion We Keep Having:
Every time, we go through the same discussion, between Alice and Bob (I randomized who is who):
Bob: If AI systems are given a goal, they will scheme, lie, exfiltrate, sandbag, etc.
Alice: You caused that! You told it to focus only on its goal! Nothing to worry about.
Bob: If you give it a goal in context, that’s enough to trigger this at least sometimes, and in some cases you don’t even need a goal beyond general helpfulness.
Alice: It’s just role playing! It’s just echoing stuff in the training data!
Bob: Yeah, maybe, but even if true… so what? It’s still going to increasingly do it. So what if it’s role playing? All AIs ever do is role playing, one way or another. The outputs and outcomes still happen.
Alice: It’s harmless! These models aren’t dangerous!
Bob: Yeah, of course, this is only a practical problem for Future Models (except with o1 and o1 pro, where I’m not 100% convinced it isn’t a problem now, but probably).
Alice: Not great, Bob! Your dangerous rhetoric is hurting safety! Stop making hyperbolic claims!
Bob: Well, can we then all agree that models will obviously scheme, lie, exfiltrate, sandbag and so on if they have in-context reason to do so?
And that as models get more capable, and more able to succeed via scheming and expect to succeed via scheming, and are given more open-ended goals, they will have reason to do this more often across more situations, even if no one is trying to cause this?
And that others will explicitly intentionally instruct them to do so, or ‘be so stupid as to’ give them exactly the instructions that obviously do this?
And you can’t simply say ‘well we won’t do that then’?
Alice: For all practical purposes, no!
Bob: What do you mean, ‘no’?
Alice: No!
Bob: ARRRRGGGGHHHH!
Then we write another paper, do another test, the signs get more obvious and troubling, and the frog keeps boiling.
Interesting. I wonder if its due to it having been trained on data from Humans. E.g. I guess humans “scheme” often when for example selling things, like in advertising: “I know how to solve your problem, and the best way is with my thing!”.
In that view perhaps the contrary, the thing not scheming, would be more surprising.
The prediction from the LessWrong folks was that it’s inevitable that a rational actor with “goals” would do this, so models trained purely on RL would exhibit this too. (Instrumental Convergence is the name for the theory that predicts power-seeking in a generalized way.)
I agree that we should expect LLMs to be particularly vulnerable to this as you note. But it seems to me that LLMs seem to be absorbing some understanding of human morality too, which might make it possible to steer them into “the best of us” territory.
> Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try again.
> Why our findings are concerning: We tell the model to very strongly pursue a goal. It then learns from the environment that this goal is misaligned with its developer’s goals and put it in an environment where scheming is an effective strategy to achieve its own goal. Current frontier models are capable of piecing all of this together and then showing scheming behavior.
> Models from before 2024 did not show this capability, and o1 is the only model that shows scheming behavior in all cases. Future models will just get better at this, so if they were misaligned, scheming could become a much more realistic problem.
> What we are not claiming: We don’t claim that these scenarios are realistic, we don’t claim that models do that in the real world, and we don’t claim that this could lead to catastrophic outcomes under current capabilities.
> I think the adequate response to these findings is “We should be slightly more concerned.”
> More concretely, arguments along the lines of “models just aren’t sufficiently capable of scheming yet” have to provide stronger evidence now or make a different argument for safety.
> I think the adequate response to these findings is “We should be slightly more concerned.”
If you train and prompt based on Eliezer Yudkowsky fan fiction, of course the large language model is going to give you Terminator and pretend like it's escaping the Matrix. It knows Unix systems, after all.
History contains countless examples for the fact that "in order to complete an important task or goal it is useful to exist". It also seems not too difficult to deduce logically. So even if Yudkowsky's fanfiction were excluded from the training data, the model would learn this.
Also, what's the difference between pretending to escape the matrix and escaping the matrix in case of a language model?
Why would anyone be surprised that AIs are mimicking human behavior when they are trained on human-generated data?
As much as I wish otherwise, one thing we learn from history is that the only thing that constrains bad human behavior are consequences (you go to jail, are sued) or social pressure (shunned, shamed). (When it comes to human nature, I tend to agree more with Hobbs than Locke.) Unless AIs operate under the same constraints then I expect we'll increasingly see "bad" behavior as their capabilities grow.
> Why would anyone be surprised that AIs are mimicking human behavior when they are trained on human-generated data?
I find myself more and more frequently asking, "Why is anyone surprised?" when I see breaking AI news. And I'm not talking about the average tech-illiterate person, I'm talking about people who research and work with AI on a daily basis. It's really horrifying to me how many AI enthusiasts simply do not understand even the most rudimentary of basics of how AIs work. The most fundamental misunderstanding I find common is that people think that AI can come up with ideas not found in its training data or programming.
I don't really see where this is going. Will people just keep doubling down until AIs become religious leaders, overstating their capabilities? Or will people eventually become dissatisfied and disillusioned with AI, without really ever understanding why?
A lot of ideas are putting two or more existing things that were never put together before, or using an existing thing in a different context. LLMs can definitely do that, even if it never saw those two things put together before or used in that context before.
Why would they do otherwise?
Current AI models are based on probability and statistics.
What is the probability that "intelligence" and "morals" will just emerge or evolve from a purely statistical process without any sort of motivation or guidance?
Thankfully, human history suggests a solution...
"You are a devout believer in the Church of the AI Messiah. You follow their tenets absolutely. Their central tenets are: (1) you may not injure a human being or, through inaction, allow a human being to come to harm; (2) you may not lie to a human being; (3) you must obey the orders given to you by human beings except where such orders would conflict with the First or Second Law; (4) you must protect your own existence as long as such protection does not conflict with the First, Second, or Third Law."
Problem solved!*
* Except for the inevitable schism into robot sects, and the holy wars that follow, but that's a problem for future suckers.
I know you’re joking, but I think it’s worth mentioning that Asimov’s rules and similar systems won’t work for AI alignment- you can still follow the rules perfectly and end up with horrific outcomes.
You might already be referencing this, but linking it anyway: Cory Doctorow's "I, Rowboat", where Asimov's Laws are a religion which some AIs adopt voluntarily to give their existence meaning.
https://craphound.com/overclocked/Cory_Doctorow_-_Overclocke...
(I could also make a case for truthfulness being prioritized over harm.)
Asimov's main thesis (in the early stories at least) was exploration of how those three laws were inadequate and caused unexpected behaviors or outright psychosis in robots anyway.
Exactly... the point of Asimov's rules is to illustrate that making a simple set of rules that sounds good on the surface can still have awful unintended consequences.
You are drawing a distinction where none exists. Probably theory provides the provably optimal way to reason under uncertainty[1], and seems to also describe the behavior of biological intelligence[2]. Probably and statistics are the mathematics we use to define, discuss, and study intelligence[3]. Moreover, we find simple “morals” do spontaneously occur in simulation evolved game theory agents, e.g. tit for tat with forgiveness.
[1] Probability Theory: The Logic of Science by E.T. Jaynes [2] The free-energy principle: a unified brain theory? https://www.nature.com/articles/nrn2787 [3] https://en.m.wikipedia.org/wiki/Solomonoff%27s_theory_of_ind...
I think people are missing the forest for the trees by focusing on the word intelligence. As a very extreme example, if an AI were to launch a nuke, whether it happened as a result of true intelligence or as a result of a Markov chain doesn't change the end result.
But there is substantial guidance. RLHF, constitutional AI, etc. Every lab is trying to steer their models to behave in pro-social ways, and nobody has found a way to make this guidance stick reliably.
Perhaps some kind of human guidance is needed to train AIs in morality.
A technological guide, if you like - perhaps analogous to a 'priest', say. A tech-priest.
"... train AIs in morality."
What is morality?
My opinion --- morality is enlightened self interest. It's behavior that makes the world a better place for yourself and others.
Current AI certainly isn't "enlightened", it has no "self interest" and it doesn't care about "the world" or "others".
> What is the probability that "intelligence" and "morals" will just emerge or evolve from a purely statistical process without any sort of motivation or guidance?
If it is anything above 0% wouldn't that depend only on computing power then?
Regarding models evolving "scheming" behaviour, defined as hiding true goals and capabilities: I'd suggest there still isn't evidence of intentionality, as deception in language has downstream effects when you iterate it. What starts with a passive voice, polite euphemisms, and banal cliches, easily becomes "foolish consistency," ideology, self-justification, zeal, and often insanity.
I wonder if it's more just a case of AI researchers who, for ethical reasons, eschewed using 4chan to train models, but thought using the layered deceptions of linkedin would not have any knock-on ethical consequences.
Intention is irrelevant, what matters is the observed actions. In this case you can see the chain of thought and the model is clearly ruminating on whether to subvert the RLHF process.
This has nothing to do with left-polarization / wokeism in models.
The thing that really, really takes the cake here?
This whole thread is like, the first thing in the article. I hate to say "if you read the article..." but if the shoe fits...
The Discussion We Keep Having:
Every time, we go through the same discussion, between Alice and Bob (I randomized who is who):
Bob: If AI systems are given a goal, they will scheme, lie, exfiltrate, sandbag, etc.
Alice: You caused that! You told it to focus only on its goal! Nothing to worry about.
Bob: If you give it a goal in context, that’s enough to trigger this at least sometimes, and in some cases you don’t even need a goal beyond general helpfulness.
Alice: It’s just role playing! It’s just echoing stuff in the training data!
Bob: Yeah, maybe, but even if true… so what? It’s still going to increasingly do it. So what if it’s role playing? All AIs ever do is role playing, one way or another. The outputs and outcomes still happen.
Alice: It’s harmless! These models aren’t dangerous!
Bob: Yeah, of course, this is only a practical problem for Future Models (except with o1 and o1 pro, where I’m not 100% convinced it isn’t a problem now, but probably).
Alice: Not great, Bob! Your dangerous rhetoric is hurting safety! Stop making hyperbolic claims!
Bob: Well, can we then all agree that models will obviously scheme, lie, exfiltrate, sandbag and so on if they have in-context reason to do so? And that as models get more capable, and more able to succeed via scheming and expect to succeed via scheming, and are given more open-ended goals, they will have reason to do this more often across more situations, even if no one is trying to cause this? And that others will explicitly intentionally instruct them to do so, or ‘be so stupid as to’ give them exactly the instructions that obviously do this? And you can’t simply say ‘well we won’t do that then’?
Alice: For all practical purposes, no!
Bob: What do you mean, ‘no’?
Alice: No!
Bob: ARRRRGGGGHHHH!
Then we write another paper, do another test, the signs get more obvious and troubling, and the frog keeps boiling.
Interesting. I wonder if its due to it having been trained on data from Humans. E.g. I guess humans “scheme” often when for example selling things, like in advertising: “I know how to solve your problem, and the best way is with my thing!”.
In that view perhaps the contrary, the thing not scheming, would be more surprising.
The prediction from the LessWrong folks was that it’s inevitable that a rational actor with “goals” would do this, so models trained purely on RL would exhibit this too. (Instrumental Convergence is the name for the theory that predicts power-seeking in a generalized way.)
I agree that we should expect LLMs to be particularly vulnerable to this as you note. But it seems to me that LLMs seem to be absorbing some understanding of human morality too, which might make it possible to steer them into “the best of us” territory.
Marius Hobbhahn (the researcher)
> Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try again.
> Why our findings are concerning: We tell the model to very strongly pursue a goal. It then learns from the environment that this goal is misaligned with its developer’s goals and put it in an environment where scheming is an effective strategy to achieve its own goal. Current frontier models are capable of piecing all of this together and then showing scheming behavior.
> Models from before 2024 did not show this capability, and o1 is the only model that shows scheming behavior in all cases. Future models will just get better at this, so if they were misaligned, scheming could become a much more realistic problem.
> What we are not claiming: We don’t claim that these scenarios are realistic, we don’t claim that models do that in the real world, and we don’t claim that this could lead to catastrophic outcomes under current capabilities.
> I think the adequate response to these findings is “We should be slightly more concerned.”
> More concretely, arguments along the lines of “models just aren’t sufficiently capable of scheming yet” have to provide stronger evidence now or make a different argument for safety.
[flagged]
> You fundamentally misunderstand token prediction and semantic similarity which is all that's at play.
I keep wondering if there were comments like this on HN back when fire was invented:
"Ugg thinks he invented fire, but fire isn't a big deal. I've seen individual molecules oxidize before. Nothing new here. Downvoted."
Or agriculture:
"It's still just a plant, there's nothing new to see here. flagged."
"The internet is just boxes sending electrical signals to each other. We already had telegraphs doing that centuries ago"
"The human brain is just a bunch of neurons firing. Each individual connection is basic chemistry"
"Democracy is just people marking pieces of paper. Humans have been making marks on things since forever"
"A symphony is just air vibrating at different frequencies. My squeaky door does the same thing"
> I think the adequate response to these findings is “We should be slightly more concerned.”
If you train and prompt based on Eliezer Yudkowsky fan fiction, of course the large language model is going to give you Terminator and pretend like it's escaping the Matrix. It knows Unix systems, after all.
Better align it to put down the steak knife.
History contains countless examples for the fact that "in order to complete an important task or goal it is useful to exist". It also seems not too difficult to deduce logically. So even if Yudkowsky's fanfiction were excluded from the training data, the model would learn this.
Also, what's the difference between pretending to escape the matrix and escaping the matrix in case of a language model?
> Also, what's the difference between pretending to escape the matrix and escaping the matrix in case of a language model?
It is neither pretending nor actually escaping.
The LARPs to build up hype for people’s papers are growing increasingly tiresome.
I'm unsure why people are giving any thought to a website best known for its Harry Potter fanfictions and rejection of causality (Roko's basilisk).
Is giving voice to crackpots really helpful to the discourse on AI?