Humans don't understand their thought process either.<p>In general, neural nets do not have insight into what they are doing, because they can't. Can you tell me what neurons fired in the process of reading this text? No. You don't have access to that information. We can recursively model our own network and say something about which regions of the brain are probably involved due to other knowledge, but that's all a higher-level model. We have no access to our own inner workings, because that turns into an infinite regress problem of understanding our understanding of our understanding of ourselves that can't be solved.<p>The terminology of this next statement is a bit sloppy since this isn't a mathematics or computer science dissertation but rather a comment on HN, but: A finite system can not understand itself. You can put some decent mathematical meat on those bones if you try and there may be some degenerate cases where you can construct a system that understands itself for some definition of "understand", but in the absence of such deliberation and when building systems for "normal tasks" you can count on the system not being able to understand itself fully by any reasonably normal definition of "understand".<p>I've tried to find the link for this before, but I know it was on HN, where someone asked an LLM to do some simple arithmetic, like adding some numbers, and asked the LLM to explain how it was doing it. They also dug into the neural net activation itself and traced what neurons were doing what. While the LLM explanation was a perfectly correct explanation of how to do elementary school arithmetic, what the neural net <i>actually</i> did was something else entirely based around how neurons actually work, and basically it just "felt" its way to the correct answer having been trained on so many instances already. In much the same way as any human with modest experience in adding two digit numbers doesn't necessarily sit there and do the full elementary school addition algorithm but jumps to the correct answer in fewer steps by virtue of just having a very trained neural net.<p>In the spirit of science ultimately being really about "these preconditions have this outcome" rather than necessarily about "why", if having a model narrate to itself about how to do a task or "confess" improves performance, then performance is improved and that is simply a brute fact, but that doesn't mean the naive human understanding about why such a thing might be is correct.
> In much the same way as any human with modest experience in adding two digit numbers doesn't necessarily sit there and do the full elementary school addition algorithm but jumps to the correct answer in fewer steps by virtue of just having a very trained neural net.<p>Right, which is strictly worse than humans are at reporting how they solve these sorts of problems. Humans can tell you whether they did the elementary school addition algorithm or not. It seems like Claude actually doesn't know, in the same way humans don't really know how they can balance on two legs, it's just too baked into the structure of their cognition to be able to introspect it. But stuff like "adding two-digit numbers" is usually straightforwardly introspectable for humans, even if it's just "oh, I just vibed it" vs "I mentally added the digits and carried the one"- humans can mostly report which it was.<p>Here's Anthropic's research:<p><a href="https://www.anthropic.com/research/tracing-thoughts-language-model" rel="nofollow">https://www.anthropic.com/research/tracing-thoughts-language...</a>
Makes me wonder if one could train a "neural net surgeon" model which can trace activations in another live model and manipulate it according to plain language instructions.