3 comments

  • oytis32 minutes ago
    Do these models really lie or do they only do what they are supposed to do - produce text that is statistically similar to the training set, but not in the training set (and thus can include false&#x2F;made up statements)?<p>Now they add another run on top of it that is in principle prone to the same issues, except they reward the model for factuality instead of likeability. This is cool, but why not apply the same reward strategy to the answer itself?
    • catigula26 minutes ago
      They really lie.<p>Not on purpose; because they are trained on rewards that favor lying as a strategy.<p>Othello-GPT is a good example to understand this. Without explicit training, but on the task of &#x27;predicting moves on an Othello board&#x27;, Othello-GPT spontaneously developed the strategy of &#x27;simulate the entire board internally&#x27;. Lying is a similar emergent, very effective strategy for reward.
      • Neywiny5 minutes ago
        Not sure if that counts as lying but I&#x27;ve heard that an ML model (way before all this GPT LLM stuff) learned to classify images based on the text that was written. For an obfuscated example, it learned to read &quot;stop&quot;, &quot;arrêt&quot;, &quot;alto&quot;, etc. on a stop sign instead of recognizing the red octagon with white letters. Which naturally does not work when the actual dataset has different text.
  • lloydatkinson14 minutes ago
    What is this?<p>&gt; Assistant: chain-of-thought<p>Does every LLM have this internal thing it doesn&#x27;t know we have access to?
  • manarth5 hours ago
    <p><pre><code> &gt; &quot;dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions&quot; &gt; &quot;As long as the &quot;path of least resistance&quot; for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest&quot; </code></pre> Humans might well benefit from this style of reward-shaping too.<p><pre><code> &gt; &quot;We find that when the model lies or omits shortcomings in its &quot;main&quot; answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training.&quot; </code></pre> I couldn&#x27;t see whether this also tracks in the primary model answer, or if the &quot;honesty&quot; improvements are confined to the digital confession booth?
    • torginus3 hours ago
      I think this article once again assumes LLMs works like humans - Anthropic showed that LLMs don&#x27;t understand their own thought processes, and measuring neural net activations does not correspond to what they say about how they arrived at the conclusion.<p>I don&#x27;t think this magically grants them this ability, they&#x27;ll be just more convincing at faking honesty.
      • jerf1 hour ago
        Humans don&#x27;t understand their thought process either.<p>In general, neural nets do not have insight into what they are doing, because they can&#x27;t. Can you tell me what neurons fired in the process of reading this text? No. You don&#x27;t have access to that information. We can recursively model our own network and say something about which regions of the brain are probably involved due to other knowledge, but that&#x27;s all a higher-level model. We have no access to our own inner workings, because that turns into an infinite regress problem of understanding our understanding of our understanding of ourselves that can&#x27;t be solved.<p>The terminology of this next statement is a bit sloppy since this isn&#x27;t a mathematics or computer science dissertation but rather a comment on HN, but: A finite system can not understand itself. You can put some decent mathematical meat on those bones if you try and there may be some degenerate cases where you can construct a system that understands itself for some definition of &quot;understand&quot;, but in the absence of such deliberation and when building systems for &quot;normal tasks&quot; you can count on the system not being able to understand itself fully by any reasonably normal definition of &quot;understand&quot;.<p>I&#x27;ve tried to find the link for this before, but I know it was on HN, where someone asked an LLM to do some simple arithmetic, like adding some numbers, and asked the LLM to explain how it was doing it. They also dug into the neural net activation itself and traced what neurons were doing what. While the LLM explanation was a perfectly correct explanation of how to do elementary school arithmetic, what the neural net <i>actually</i> did was something else entirely based around how neurons actually work, and basically it just &quot;felt&quot; its way to the correct answer having been trained on so many instances already. In much the same way as any human with modest experience in adding two digit numbers doesn&#x27;t necessarily sit there and do the full elementary school addition algorithm but jumps to the correct answer in fewer steps by virtue of just having a very trained neural net.<p>In the spirit of science ultimately being really about &quot;these preconditions have this outcome&quot; rather than necessarily about &quot;why&quot;, if having a model narrate to itself about how to do a task or &quot;confess&quot; improves performance, then performance is improved and that is simply a brute fact, but that doesn&#x27;t mean the naive human understanding about why such a thing might be is correct.
        • roywiggins1 hour ago
          &gt; In much the same way as any human with modest experience in adding two digit numbers doesn&#x27;t necessarily sit there and do the full elementary school addition algorithm but jumps to the correct answer in fewer steps by virtue of just having a very trained neural net.<p>Right, which is strictly worse than humans are at reporting how they solve these sorts of problems. Humans can tell you whether they did the elementary school addition algorithm or not. It seems like Claude actually doesn&#x27;t know, in the same way humans don&#x27;t really know how they can balance on two legs, it&#x27;s just too baked into the structure of their cognition to be able to introspect it. But stuff like &quot;adding two-digit numbers&quot; is usually straightforwardly introspectable for humans, even if it&#x27;s just &quot;oh, I just vibed it&quot; vs &quot;I mentally added the digits and carried the one&quot;- humans can mostly report which it was.<p>Here&#x27;s Anthropic&#x27;s research:<p><a href="https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;tracing-thoughts-language-model" rel="nofollow">https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;tracing-thoughts-language...</a>
        • hnuser1234561 hour ago
          Makes me wonder if one could train a &quot;neural net surgeon&quot; model which can trace activations in another live model and manipulate it according to plain language instructions.
      • wongarsu2 hours ago
        Humans do a lot of post-hoc rationalization that does not match their original thought processes either. It is an undesirable feature in LLMs, but I don&#x27;t think this is a very un-human characteristic<p>Not that it really matters. I don&#x27;t think this paper starts from a point that assumes that LLMs work like humans, it starts from the assumption that if you give gradient descent a goal to optimize for, it will optimize your network to that goal, with no regard for anything else. So if we just add this one more goal (make an accurate confession), then given enough data that will both work and improve things.
      • pfortuny2 hours ago
        Honest question:<p>&gt; Anthropic showed that LLMs don&#x27;t understand their own thought processes<p>Where can I find this? I am really interested in that. Thanks.
        • roywiggins1 hour ago
          <a href="https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;tracing-thoughts-language-model" rel="nofollow">https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;tracing-thoughts-language...</a><p>&gt; Claude, on occasion, will give a plausible-sounding argument designed to agree with the user rather than to follow logical steps. We show this by asking it for help on a hard math problem while giving it an incorrect hint. We are able to “catch it in the act” as it makes up its fake reasoning, providing a proof of concept that our tools can be useful for flagging concerning mechanisms in models...<p>&gt; Claude seems to be unaware of the sophisticated &quot;mental math&quot; strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math &quot;in its head&quot; directly, without any such hints, and develops its own internal strategies to do so.
        • encyclopedism2 hours ago
          Well algorithms don&#x27;t think. That&#x27;s what LLM&#x27;s are.<p>Your digital thermometer doesn&#x27;t think either.
          • roywiggins40 minutes ago
            The question is more whether LLMs can accurately report their internal operations, not whether any of that counts as &quot;thinking.&quot;<p>Simple algorithms can, eg, be designed to report whether they hit an exceptional case and activated a different set of operations than usual.
          • pfortuny1 hour ago
            I was asking for a technical argument against that spurious use of the term.