Do these models really lie or do they only do what they are supposed to do - produce text that is statistically similar to the training set, but not in the training set (and thus can include false/made up statements)?<p>Now they add another run on top of it that is in principle prone to the same issues, except they reward the model for factuality instead of likeability. This is cool, but why not apply the same reward strategy to the answer itself?
They don't really lie, they just produce text.<p>But the Eliza effect is amazingly powerful.<p>Applying the same reward strategy to the answer itself would be a more intellectually honest approach, but would rub our noses in the fact that LLMs don't have any access to "truth" and so at best we'd be conditioning them to be better at fooling us.
Do LLM’s lie? Consider a situation in a screenplay where a character lies, compared to one where the character tells the truth. It seems likely that LLM’s can distinguish these situations and generate appropriate text. Internally, it can represent “the current character is lying now” differently than “the current character is telling the truth.”<p>And earlier this year there was some interesting research published about how LLM’s have an “evil vector” that, if enabled, gets them to act like stereotypical villains.<p>So it seems pretty clear that <i>characters</i> can lie even if the LLM’s task is just “generate text.”<p>This is fiction, like playing a role-playing game. But we are routinely talking to LLM-generated ghosts and the “helpful, harmless” AI assistant is not the only ghost it can conjure up.<p>It’s hard to see how role-playing can be all that harmful for a rational adult, but there are news reports that for some people, it definitely is.
Because you want both likeability and factuality and if you try to mash them together they both suffer. The idea is that by keeping them separate you reduce concealment pressure by incentivizing accurate <i>self-reporting</i>, rather than <i>appearing</i> correct.
It is likely that the training set contains stuff like rationalizations, or euphemisms, in contexts that are not harmful. I think those are inevitable.<p>Eventually, and specially in reasoning models, these behaviors will generalize outsite their original context.<p>The "honesty" training seems to be an attempt to introduce those confession-like texts in training data. You'll then get a chance of the model engaging in confessing. It won't do it if it has never seen it.<p>It's not really lying, and it's not really confessing, and so on.<p>If you reward pure honesty always, the model might eventually tell you that he wouldn't love you if you were a worm, or stuff like that. Brutal honesty can be a side effect.<p>What you actually want is to be able to easily control which behavior the model engages, because sometimes you will want it to lie.<p>Also, lies are completely different from hallucinations. Those (IMHO) are when the model displays behavior that is non-human and jarring. Side effects. Probably inevitable too.
They really lie.<p>Not on purpose; because they are trained on rewards that favor lying as a strategy.<p>Othello-GPT is a good example to understand this. Without explicit training, but on the task of 'predicting moves on an Othello board', Othello-GPT spontaneously developed the strategy of 'simulate the entire board internally'. Lying is a similar emergent, very effective strategy for reward.
> They really lie. Not on purpose<p>You can't lie by accident. You can tell a falsehood, however.<p>But where LLMs are concerned, they don't tell truths or falsehoods either, as "telling" also requires intent. Moreover, LLMs don't actually contain propositional content.
Reference: <a href="https://www.science.org/content/article/ai-hallucinates-because-it-s-trained-fake-answers-it-doesn-t-know" rel="nofollow">https://www.science.org/content/article/ai-hallucinates-beca...</a><p>If you don't know the answer, and are only rewarded for correct answers, guessing, rather than saying "I don't know", is the optimal approach.
It's more than just that, but thanks for that link, I've been meaning to dig it up and revisit it. Beyond hallucinations, there are also deceptive behaviors like hiding uncertainty, omitting caveats or doubling down on previous statements even when weaknesses are pointed out to it. Plus there necessarily will be lies in the training data as well, sometimes enough of them to skew the pretrained/unaligned model itself.
Not sure if that counts as lying but I've heard that an ML model (way before all this GPT LLM stuff) learned to classify images based on the text that was written. For an obfuscated example, it learned to read "stop", "arrêt", "alto", etc. on a stop sign instead of recognizing the red octagon with white letters. Which naturally does not work when the actual dataset has different text.
typographic attacks against vision-language models are still a thing with more recent models like GPT4-V: <a href="https://arxiv.org/abs/2402.00626" rel="nofollow">https://arxiv.org/abs/2402.00626</a>
That does feel a little more like over-fitting, but you might be able to argue that there's some philosophical proximity to lying.<p>I think, largely, the<p><pre><code> Pre-training -> Post-training -> Safety/Alignment training
</code></pre>
pipeline would obviously produce 'lying'. The trainings are in a sort of mutual dissonance.
They mostly imitate patterns in the training material. They do it in response to what gets the reward up for RL training. There's probably lots of examples of both lying and confessions in the training data. So, it should surprise nobody that next, sentence machines fill in a lie or confession in situations similar to ghe training data.<p>I don't consider that very intelligent or more emergent than other behaviors. Now, if nothing like that was in training data (pure honesty with no confessions), it would be very interesting if it replied with lies and confessions. Because it wasn't pretrained to lie or confess like the above model likely was.
These models don't even choose 1 outcome. They list probabilities of ALL the tokens outcomes and the backend program decides to choose the one that is most probable OR a different one.<p>But in practical usage, if an llm does not rank token probability correctly it will feel the same as it "lying"<p>They are supposed to do whatever we want them to do. They WILL do what the deterministic nature of their final model outcome forces them to do.
Lying requires intent by definition. LLMs do not and cannot have intent, so they are incapable of lying. They just produce text. They are software.
AFAICT, there are several sources of untruth in a model's output. There are unintentional mistakes in the training data, intentional ones (i.e. lies/misinformation in the training data), hallucinations/confabulations filling in for missing data in the corpus, and lastly, deceptive behavior instilled as a side-effect of alignment/RL training. There is intent of various strength behind of all of these, originating from the people and organization behind the model. They want to create a successful model and are prepared to accept certain trade-offs in order to get there. Hopefully the positives outweigh the negatives, but it's hard to tell sometimes.