Going meta a bit: comments so far on this post show diametrically opposing understandings of the paper, which demonstrates just how varied the interpretation of complex text can be.<p>We hold AI to a pretty high standard of correctness, as we should, but humans are not that reliable on matters of fact, let alone on rigor of reasoning.
This is extremely common in these discussions. Most humans are not that good at reasoning themselves and fall for the same kind of fallacies over and over because of the way they were brought up (their training data so to speak). And yet they somehow think they can argue why or why not LLMs should be able to do the same. If anything, the current limits of these morels show the limits of human cognition which is spread throughout the internet - because this is literally what they learned from. I believe once we achieve a more independent learning (like we've seen glimpses of in the MuZero paper) these models will blow human intelligence out of the water.
> Most humans are not that good at reasoning themselves and fall for the same kind of fallacies over and over because of the way they were brought up<p>Disagree that it's easy to pin on "how they were brought up". It seems very likely that we may learn that the flaws are part of what makes our intelligence "work" and be adaptive to changing environments. It may be favourable in terms of cultural evolution for parents to indoctrinate flawed logic, not unlike how replication errors are part of how evolution can and must work.<p>In other words: I'm not sure these "failures" of the models are actual failures (in the sense of being non-adaptive and important to the evolutionary processes of intelligence), and further, it is perhaps <i>us humans</i> that are "failing" by <i>over</i>-indexing on "reason" as explanation for how we arrived here and continue to persist in time ;)
> Disagree that it's easy to pin on "how they were brought up".<p>Indeed. That might play a role, but another less politically charged aspect to look at is just: how much effort is the human currently putting in?<p>Humans are often on autopilot, perhaps even most of the time. Autopilot means taking lazy intellectual shortcuts. And to echo your argument: in familiar environments those shortcuts are often a good idea!<p>If you just do whatever worked last time you were in a similar situation, or whatever your peers are doing, chances are you'll have an easier time than reasoning everything out from scratch. Especially in any situations involving other humans cooperating with you, predictability itself is an asset.
> And to echo your argument: in familiar environments those shortcuts are often a good idea!<p>Only to continue to reaffirm the original post, this was some of the basis for my dissertation. Lower-level practice, or exposure to tons of interactive worked examples, allowed students to train the "mental muscle memory" for coding syntax to learn the more general CS concept (like loops instead of for(int i = 0...). The shortcut in this case is learning what the syntax for a loop looks like so that it can BECOME a shortcut. Once its automatic, then it can be compartmentalized as "loop" instead of getting anxious over where the semicolons go.
> Humans are often on autopilot, perhaps even most of the time<p>I wonder what responsiveness/results someone would get running an LLM with just ~20 watts for processing and memory, especially if it was getting trained at the same time.<p>That said, we do have a hardware advantage, what with the enormous swarm of nano-bots using technology and techniques literally beyond our best science. :p
It's because we can put responsibility on humans to be correct but we can't on computers. Humans given the appropriate incentives are <i>very</i> good at their jobs and there is a path for compensation if they screw up. Computers have neither of these things.
I think this is a key argument in how powerful AI can become. We may be able to create incredibly intelligent systems, but at the end of the day you can’t send a computer to jail. That inherently limits the power that will be given over to AI. If an AI accidentally kills a person, the worst that could be done to it is that it is turned off, whereas the owners of the AI would be held liable.
Humans already put a lot of trust in computers not because they can take responsibility but because traditional software can be made very predictable or at least compliant. There are whole industries built around software standards to ensure that. The problem is we don't yet know enough about identifying and patching problems in these models. Once we get something equivalent to MISRA for LLMs to achieve the same level of compliance, there is very little that could still hold them back.
> Most humans are not that good at reasoning themselves [...]<p>I'd say most humans most of the time. Individual humans can do a lot better (or worse) depending on how much effort they put in, and whether they slept well, had their morning coffee, etc.<p>> If anything, the current limits of these morels show the limits of human cognition which is spread throughout the internet - because this is literally what they learned from.<p>I wouldn't go quite so far. Especially because some tasks require smarts, even though there's no smarts in the training data.<p>The classic example is perhaps programming: the Python interpreter is not intelligent by any stretch of the imagination, but an LLM (or a human) needs smarts to predict what's going to do, especially if you are trying to get it to do something specific.<p>That example might skirt to close to the MuZero paper that you already mentioned as an exception / extension.<p>So let's go with a purer example: even the least smart human is a complicated system with a lot of hidden state, parts of that state shine through when that human produces text. Predicting the next token of text just from the previous text is a lot harder and requires a lot more smarts than if you had access to the internal state directly.<p>It's sort-of like an 'inverse problem'. <a href="https://en.wikipedia.org/wiki/Inverse_problem" rel="nofollow">https://en.wikipedia.org/wiki/Inverse_problem</a>
Human society, when the individual reaches the limits of their "reasoning", usually produce growths to circumvent these limitations to produce and use artifacts that lurk beyond their limitations. A illiterate can still use Netflix, etc.<p>The ability to circumvent these limitations, is encoded in company procedures, architecture of hierarchies/gremiums within companies and states. Could AI be "upgraded" beyond human reasoning, by referencing these "meta-organisms" and their reasoning processes that can produce things that are larger then the sum of its parts?<p>Could AI become smarter by rewarding this meta-reasoning and prompting for it?<p>"Chat GPT for your next task, you are going to model a company reasoning process internally to produce a better outcome"<p>This should also allow to circumvent human reasoning bugs - like tribal thinking (which is the reason why we have black and white thinking. You goto agree with the tribes-group-think, else there be civil war risking all members of the tribe. Which is why there always can only be ONE answer, one idea, one plan, one leader - and multiple simultaneous explorations at once as in capitalism cause deep unease)
> We hold AI to a pretty high standard of correctness, as we should, but humans are not that reliable on matters of fact, let alone on rigor of reasoning.<p>I never understood this line of reasoning.<p>1. Humans can't run faster than 30 mph.<p>2. Therefore we can't complain if cars/trains/transport always go slower than 30 mph.<p>These comparisons also hide that we are comparing best of AI (massive LLMs) with median/average human reasoning.
The whole social dynamic of this conversation is amazing. How fast complacency happened. In 2010 if you told me I could get a model to respond approximately as intelligently as a low intelligence human, I would be amazed. As a matter of perspective,I am still amazed.<p>At the same time I see such negative sentiment around the capabilities at their current limits.<p>We are reaching an era of commodified intelligence, which will be disruptive and surprising. Even the current limited models change the economics dramatically.
> Even the current limited models change the economics dramatically.<p>Yes, though at the moment they hype is still a lot bigger than the impact.<p>But I am fairly confident that even without any new technical ideas for the networks themselves, we will see a lot more economic impact over the next few years, as people work out how to use these new tools.<p>(Of course, the networks will also evolve still.)
The main issue is the expectations. The companies behind these models marketed them as being intelligent or close to it so it's natural for people to expect that and react as such when the expectations are not met
It seems obvious to me that LLMs wouldn't be able to find examples of every single problem posed to them in training data. There wouldn't be enough examples for the factual look up needed in an information retrieval style search. I can believe that they're doing some form of extrapolation to create novel solutions to posed problems.<p>It's interesting that this paper doesn't contradict the conclusions of the Apple LLM paper[0], where prompts were corrupted to force the LLM into making errors. I can also believe that LLMs can only make small deviations from existing example solutions in creation of these novel solutions.<p>I hate that we're using the term "reasoning" for this solution generation process. It's a term coined by LLM companies to evoke an almost emotional response on how we talk about this technology. However, it does appear that we are capable of instructing machines to follow a series of steps using natural language, with some degree of ambiguity. That in of itself is a huge stride forward.<p>[0] <a href="https://machinelearning.apple.com/research/gsm-symbolic" rel="nofollow">https://machinelearning.apple.com/research/gsm-symbolic</a>
I very much agree with the perspective that LLMs are not suited for “reasoning” in the sense of creative problem solving or application of logic. I think that the real potential in this domain is having them act as a sort of “compiler” layer that bridges the gap between natural language - which is imprecise - and formal languages (sql, prolog, python, lean, etc) that are more suited for solving these types of problems. And then maybe synthesizing the results / outputs of the formal language layer. Basically “agents”.<p>That being said, I do think that LLMs are capable of “verbal reasoning” operations. I don’t have a good sense of the boundaries that distinguish the logics - verbal, qualitative, quantitative reasoning. What comes to my mind is the verbal sections of standardized tests.
> I think that the real potential in this domain is having them act as a sort of “compiler” layer that bridges the gap between natural language - which is imprecise - and formal languages (sql, prolog, python, lean, etc) that are more suited for solving these types of problems. And then maybe synthesizing the results / outputs of the formal language layer. Basically “agents”.<p>Well, if you do all that, would you say that the system has a whole has 'reasoned'? (I think ChatGPT can already call out to Python.)
<i>I can believe that they're doing some form of extrapolation to create novel solutions to posed problems</i><p>You can believe it what sort of evidence are you using for this belief?<p>Edit: Also, the abstract of the Apple paper hardly says "corruption" (implying something tricky), it says that they changed the initial numerical values
Totally, these companies are pushing towards showcasing their AI models as self thinking and reasoning AI while they are just trained of a lot of amount of data in dataset format which they extrapolate to find the right answer.<p>They still can't think outsider their box of datasets
You mean you need humans to step-by-step solve a problem so a neural net can mimic it? It sounds kinda obvious now that I write it out.
This would explain the unexpected benefits of training on code.
That sounds interesting, but I'm a layman and don't know anything about it. Can you provide a link?<p>I was able to find <a href="https://arxiv.org/abs/2408.10914" rel="nofollow">https://arxiv.org/abs/2408.10914</a>, but I don't have the context to know whether it's the paper you're talking about.
I think GP was probably referring to "Scaling Data-Constrained Language Models" (2305.16264) from NeurIPS 2023, which looked first at how to optimally scale LLMs when training data is limited. There is a short section on mixing code (Python) into the training data and the effect this has on performance on e.g. natural language tasks. One of their findings was that training data can be up to 50% code without actually degrading performance, and in some cases (benchmarks like bAbI and WebNLG) with improvements (probably because these tasks have an emphasis on what they call "long-range state tracking capabilities").<p>For reference: In the Llama 3 technical report (2407.21783), they mention that they ended up using 17% code tokens in their training data.
There was an interview with Zuckerberg about how they initially split training llama chat models on purely normal text and codellama on code, but later realized that if they combine the training set they get a model that is better at both tasks than each specialized one was.
> In the extreme
case, a language model answering reasoning questions may rely heavily on retrieval from parametric
knowledge influenced by a limited set of documents within its pretraining data. In this scenario, specific documents containing the information to be retrieved (i.e. the reasoning traces) contribute
significantly to the model’s output, while many other documents play a minimal role.<p>> Conversely,
at the other end of the spectrum, the model may draw from a broad range of documents that are
more abstractly related to the question, with each document influencing many different questions
similarly, but contributing a relatively small amount to the final output. We propose generalisable
reasoning should look like the latter strategy.<p>Isn't it much more impressive if a model can generalize from a single example?
>On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies<p>surprised this gets voted up given the surprising amount of users on HN who think LLMs can't reason at all and that the only way to characterize an LLM is through the lens of a next token predictor. Last time I was talking about LLM intelligence someone rudely told me to read up on how LLMs work and that we already know exactly how they work and they're just token predictors.
The “surprising gaps” are precisely because they’re not reasoning—or, at least, not “reasoning” about the things a human would be to solve the problems, but about some often-correlated but different set of facts about relationships between tokens in writing.<p>It’s the failure modes that make the distinction clearest.<p>LLM output is only meaningful, in the way we usually mean that, at the point we assigned external, human meaning to it, after the fact. The LLM wouldn’t stop operating or become “confused” if fed gibberish, because the meaning it’s extracting doesn’t depend on the meaning <i>humans</i> assign things, except by coincidence—which coincidence we foster by feeding them with things we do <i>not</i> regard as gibberish, but that’s beside the point so far as how they “really work” goes.
The loudest people seem to be those with the most extreme positions, and that includes on "is ${specific AI} (useless|superhuman) for ${domain}?". Perhaps it's just perception, but perhaps the arguments make them persist, as CGP Grey pointed out: <a href="https://www.youtube.com/watch?v=rE3j_RHkqJc" rel="nofollow">https://www.youtube.com/watch?v=rE3j_RHkqJc</a><p>As I'm in the middle, I get flack from people on both extremes, as I'm outside their (equivalent of or just literally?) Overton window on this subject. Seems like an odd zone to be in for the opinion "this is a useful tool, but I see loads of ways it can go wrong". Makes me wonder what the <i>real</i> common discourse was of looms during the industrial revolution, and not just the modern summary of that era.
> Makes me wonder what the <i>real</i> common discourse was of looms during the industrial revolution, and not just the modern summary of that era.<p>Interesting question. I did a little searching with help from Claude, ChatGPT, Google, and the Internet Archive. Here are some links to writings from that time:<p>“Thoughts on the use of machines, in the cotton manufacture” (1780)<p><a href="https://archive.org/details/bim_eighteenth-century_thoughts-on-the-use-of-m_1780" rel="nofollow">https://archive.org/details/bim_eighteenth-century_thoughts-...</a><p>Excerpt: “How many writers and copiers of books were thrown out of employment, or obliged to change it, by the introduction of printing presses? About ten years ago, when the Spinning Jennies came up, old persons, children, and those who could not easily learn to use the new machines, did suffer, for a while; till families had learned to play into one another's hands, by each taking a different kind of work. But the general benefit, which was received from the machines, very soon silenced all objections. And every sensible man now looks upon them with gratitude and approbation. It is probable, this will be the case in all new inventions.”<p>Kevin Binfield, ed., <i>Writings of the Luddites</i> (1811-1815; 2004)<p><a href="https://ia903409.us.archive.org/16/items/writings-of-the-luddites/Writings%20of%20the%20Luddites.pdf" rel="nofollow">https://ia903409.us.archive.org/16/items/writings-of-the-lud...</a><p>Robert Owen, <i>Observations on the effect of the manufacturing system</i> (1817)<p><a href="https://archive.org/details/observationsonef00owenrich/page/n5/mode/2up" rel="nofollow">https://archive.org/details/observationsonef00owenrich/page/...</a><p>William Radcliffe, <i>Origin of the new system of manufacture commonly called power-loom weaving</i> (1828)<p><a href="https://archive.org/details/originofnewsyste0000radc" rel="nofollow">https://archive.org/details/originofnewsyste0000radc</a><p>“An address to the Glasgow cotton-spinners on the moral bearing of their association” (1838)<p><a href="https://catalogue.nla.gov.au/catalog/6023196" rel="nofollow">https://catalogue.nla.gov.au/catalog/6023196</a>
The reality is that both can be true at the same time. Yes they're next token predictors, but sometimes the only way to do that correctly is by actually understanding everything that came before and reasoning logically about it. There's some Sutskever quote that if the input to a model is most of a crime novel, and the next token is the name of the perpetrator, then the model understood the novel.<p>Transformers are arbitrary function approximators, so there's no hard limitation on what they can or cannot do.
This is highly relevant to the recent discussion at <a href="https://news.ycombinator.com/item?id=42285128">https://news.ycombinator.com/item?id=42285128</a>.<p>Google claims that their use of pretraining is a key requirement for being able to deliver a (slightly) better chip design. And they claim that a responding paper that did not attempt to do pretraining, should have been expected to be well below the state of the art in chip design.<p>Given how important reasoning is for chip design, and given how important pretraining is for driving reasoning in large language models, it is obvious that Google's reasoning is very reasonable. If Google barely beats the state of the art while using pretraining, an attempt that doesn't pretrain should be expected to be well below the current state of the art. And therefore that second attempt's poor performance says nothing about whether Google's results are plausible.
I am not an expert in the particular application domain of that article; but I can see why their argument of pre training might be valid. It is not especially controversial to say that pre training neural nets improves few shot learning performance. And I suspect there is an inflection point for every problem where pre trained neural nets yield better few shot learning performance than less data hungry approaches - such as hand crafted features or strong priors.<p>That being said, it seems that the question here is whether that inflection point has been reached in this case.
that resonates - less facts and more reasoning training data. The most low hanging in terms of non synthetic data probably being mathematical proofs. With prolog and the like many alternate reasoning paths could be generated. It's hard to say if these many-path would help in llm training without access to the gigantic machines (it's so unfair) to try it on.
Does this mean LLMs might do better if trained on large amounts of student notes, exams, book reviews and such? That would be incredibly interesting.
Is this conclusion similar to my layman's understanding of AlphaGo vs AlphaZero? That human procedural knowledge helps ML training to a point, and from there on becomes a limitation?
No. They're saying that the model they analyzed used mainly information on _how_ to solve math problems from its training data, rather than documents that contained the answers to the (identical) math problems:<p>> "We investigate which data influence the model’s produced reasoning traces and how those data relate to the specific problems being addressed. Are models simply ‘retrieving’ answers from previously seen pretraining data and reassembling them, or are they employing a more robust strategy for generalisation?"<p>> "When we characterise the top ranked documents for the reasoning questions qualitatively, we confirm that the influential documents often contain procedural knowledge, like demonstrating how to obtain a solution using formulae or code. Our findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar form of reasoning."<p>Example reasoning question:
> "Prompt
Calculate the answer: (7 - 4) * 7
Think step-by-step."
drives retrieval of patterns of procedure?<p>I mean - like for arithmetic?
thanks