How LLMs work

(0xkato.xyz)

822 points by 0xkato3 days ago

35 comments

malwrar18 hours ago
Back when ChatGPT came out, I was so shocked by how _good_ it was for an “AI” product that I simply had to know how it worked. Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.
- ekunazanu14 hours ago
 > This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilitiesBasically, the bitter lesson: <a href="https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf" rel="nofollow">https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...</a>
 - williamstein6 hours ago
 This interview <a href="https://youtu.be/oWOz2htozfI?si=qdQ0uZRoZOYeThOn" rel="nofollow">https://youtu.be/oWOz2htozfI?si=qdQ0uZRoZOYeThOn</a> from 2 days ago with a top researcher from OpenAI directly addresses the bitter lesson argument and the importance of scaling for the history of their models.
 - jochembrouwer1 hour ago
 So the take-away here is that we (as humans) try to model these AIs like humans, but eventually these AIs get better. Which to me seems like a logical conclusion if they can do "things" (like "learning" or pattern matching) much faster than we can (the compute). Then language in LLMs is a bottleneck, the AI is constrained by the language, and thus if we want to scale further we could let AI create its own language (we would then have to translate whatever it creates back to a language we understand). It is the same for instance if we check the language of Inuit (people who live in the north and make temorary shelters like igloos in the snow) they have multiple words/verbs to describe the snow, while in English we only have one (?): snow. In English we don't need more words (we can explain snow state using multiple words) but for the Inuit language it makes sense to create these new terms (would also make it easier and faster to communicate). So in some sense, all languages are then "newspeak" to whatever a general language is what researches or AI might come up with. If this sounds dumb let me know, but if you know some research in this general language direction (I'd assume general AI research) would love to see it!
 - xnx6 hours ago
 Isn't the bitter lesson basically the same as "The Unreasonable Effectiveness of Data" from 2009?
 - swyx3 hours ago
 not exactly, bitter lesson is one meta-level up from "scale eats everything". this is a common misunderstanding of bitter lesson that rich sutton has been fighting ever since the thing was written. in rich's own words[1], the modern summary is> Don’t be distracted by human knowledge, as AI has been historically.> Instead focus on methods for creating knowledge that scale with computation, like search and learning.so the lesson is choose methods that scale with computation, not just that blindly scaling up anything (data, params, people, whatever) works, it is choosing the right x axis and the right scaling laws consistently wins out in the long run despite short term wins from other methods.1: <a href="https://x.com/RichardSSutton/status/2056419165502935198" rel="nofollow">https://x.com/RichardSSutton/status/2056419165502935198</a>
- forestsitter6 hours ago
 Same. I recall reading a paper by Stephen Wolfram after ChatGPT came out where he goes over how it works and what it does. Such a good piece and really got me going with this stuff. <a href="https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/" rel="nofollow">https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...</a>
- jfim17 hours ago
 Indeed. It's pretty interesting to realize after implementing GPT-2 that the frontier models are scaled up versions of that, with various tweaks to improve performance, model-wise.The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.
 - gobdovan15 hours ago
 The secret sauce is also having the necessary 'creativity' to not get ceased and desisted into oblivion and jail from all the copyrighted material you trained your model on. Btw, not making a moral judgement, [0] shows Michael and Dalton from YC discussing why Ilya Sutskever had to leave Google to pursue what's now ChatGPT[0] <a href="https://youtu.be/E8pvgN1j-Ck?t=748" rel="nofollow">https://youtu.be/E8pvgN1j-Ck?t=748</a>
 - root-parent8 hours ago
 There is a whole moral judgement to be made here...lets hope Ilya wont get too pissed off if somebody leaks the work of his new initiative...information wants to be free and all that...Also would love to know if the same Legal team advised on Gemini...
 - someguyiguess8 hours ago
 And to make anyone who threatened to expose them “commit suicide”
 - miltonlost7 hours ago
 He's a massive massive thief that people who have stolen far less from a convenience store have gone to prison for. The man is a villain.
 - achrono16 hours ago
 How do we know that today's frontier models are merely scaled up versions of that? Genuine question, since the labs have narrowed what they share over the years to now almost nothing, in terms of how the model was trained and how it works under the hood.
 - HarHarVeryFunny8 hours ago
 We know for sure the architecture of the open weights models since llama.cpp understands the architecture it needs to build to plug the weights into to run them. It's always possible that the latest closed model is doing something architecturally different than the open weights ones we know about, but judging by how close the large open weight models such as DeepSeek are to SOTA performance, this seems unlikely. When OpenAI first came out with their near-mythical "Strawberry" (aka "o1") thinking model there was all sorts of speculation that they had made some sort of architectural breakthough, but then DeepSeek replicated the capability and published how they did it, proving that it was just better training, not any architectural change.There have been minor changes to the architecture over the years, but these are basically all efficiency tweaks such as various types of attention (some pioneered in the open by DeepSeek) that better scale to large context lengths, and the confusingly named "mixture of experts" architecture, but what's more notable really is how little the architecture has changed. The capability gains have been coming from better training and better data.
 - gobdovan15 hours ago
 DeepSeek research:- V3 <a href="https://arxiv.org/abs/2412.19437" rel="nofollow">https://arxiv.org/abs/2412.19437</a>- V2 <a href="https://arxiv.org/abs/2405.04434" rel="nofollow">https://arxiv.org/abs/2405.04434</a>- R1 <a href="https://arxiv.org/abs/2501.12948" rel="nofollow">https://arxiv.org/abs/2501.12948</a> (RL applied to ML models was well-known beforehand, but they show it in the open, at scale, on big models)Then, there's the incentive analysis. If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare. I'm not saying there's noone cooking a new architecture, just that it is a pretty rare event. And it would have to come from some researchers that would be happy to not publish their findings, which is not really what a sizable portion of elite researchers (obviously not all) are incentivized to do.Of course, it's a bit of a verbal compression to claim simply 'scaled up'. They are recognisable scaled up transformers, but most new models come with a few tricks, but we're at the point where those usually are not an architectural rewrite and added to solve an explicit problem, like hallucination, not for big new capability gains.
 - swyx3 hours ago
 > If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rarec.f. hardware lotter <a href="https://arxiv.org/abs/2009.06489" rel="nofollow">https://arxiv.org/abs/2009.06489</a>
 - matusp15 hours ago
 There are thousands of people working in top level labs. Somebody would leak it
 - ai_slop_hater16 hours ago
 No they are clearly not just scaled up versions of gpt 2; there are different LLM architectures like mixture of experts etc that appeared relatively recently. I am not an expert though, far from it.
 - otabdeveloper416 hours ago
 MoE and such are basically performance enhancements, they don't make the model smarter.
 fizx1 hour ago
 Performance enhancements are what allow you to train a bigger model.
 yababa_y15 hours ago
 separately trained experts can surpass performance in their activated regime and DOES result in a smarter model, the Claude system cards talk about this and eg there is <a href="https://openreview.net/forum?id=iydmH9boLb" rel="nofollow">https://openreview.net/forum?id=iydmH9boLb</a> to read...
 jmalicki9 hours ago
 Performance enhancements are huge though.If you can make the existing model faster, you can then save your inference budget to then make your model bigger, which then makes it smarter.A lot of how smart the models can be comes down to budget. If you can make your existing thing cheaper, you can instead make it bigger for the same price.
 TheHalfDeafChef9 hours ago
 Not really “smarter” though? It’s just a big probability engine.(Not trying to flame bait or anything. I just wouldn’t call LLM as exhibiting intelligence. It is great at making connections based on probability but doesn’t have a semantic understanding of what it is doing)
 otabdeveloper48 hours ago
 > to then make your model bigger, which then makes it smarterThere's diminishing returns and at some point making a model bigger makes it dumber.
 - locknitpicker7 hours ago
 > The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.ReAct loops and tool-calling are the critical development feature. They turn a model from something that generates text into something that can independently influence the world around them.Without agent features, you have just a chatbot.
 - galaxyLogic2 hours ago
 The big breakthrough is we can interact with the agents using natural language - because of the LLM.It is the combination of LLM and agent-harnesses that make it look really smart. Agent-harness is a programmatic device that lets us tap into the vast knowledge in the LLM.It is probabaly true that many TV-commentators fail to appreciate this fact and therefore think LLMs are super-intelligent. No, it is the combination of LLM and the programmatic agent-haness that is the breakthrough.An interesting thought is that the LLM could in theory code the agent-harrness, start it running every time we interact with it. Currently the agent-harrness I think is pretty static I think. In theory it could be dynamically created for every task. Would that make it better don't know.
- antirez14 hours ago
 There is a different way to look at this: that is, actually the Transformer is a minimal complication of what the based model is: in theory the neural network could be just a huge FFN, which is anyway the part of the Transformer that does the heavy lifting. But this would be impossibile to train both numerically and computationally, so the Transformer encodes enough priors for it to work: the causal attention, and the math tricks like the residuals and so forth. But the bottom line of all this is that the Transformer works because of the incredible semantical power of simple/huge FFNs.
 - dist-epoch11 hours ago
 Isn't that over-simplifying it a bit too much?You can go another step - a FFN can be simulated on a Turing machine, thus it just exemplifies the incredible semantical power of the Turing machine model of computation. (in fact you don't even need a Turing machine, since there is no looping in one forward pass).In theory you can run a huge FFN on the tiniest Turing machine, in practice it's much better to run a Transformer on the latest NVIDIA hardware. Or as they say "quantity (performance) has a quality all its own"
 - musebox3510 hours ago
 I was about to post your last point / quote. Going multigpu is relatively not so though but once you go multi-node you have distributed storage/io/compute system which is highly non trivial. Add that the long training times now you have robustness/fault-tolerantness concerns with hardware failures and restarts. Today’s training systems are engineering marvels.
 - zbendefy11 hours ago
 Good point!There is also the case for Markov chains being theoretically able to do these if tuned well. Or even SAT problem.
 - CGMthrowaway7 hours ago
 "LLM is just fancy autocomplete"
 galaxyLogic2 hours ago
 LLM is an Oracle
 - slickytail12 hours ago
 [dead]
- crossroadsguy14 hours ago
 What hopes/paths does a mere CS bachelor (not deep into stats/maths), and mid level dev (native mobile only; 10-15 years exp.), have about not only understanding it (maybe not fully) but getting possibly into this as a career? Not expecting churning out models and AI systems from the first weeks/months but entry/employment into this field?(If I can be honest, and I am not being disparaging about anything lest it might seem so, I am looking at it from a career breakthrough/move perspective rather than an intellectual pursuit.)
 - 2muchcoffeeman12 hours ago
 I think you need to ask what you actually want to do with the AI.If you want to be a researcher and come out with the next breakthrough, get ready to go back to school and learn some math.If you just need to learn how to use it well and build things with it, then you probably just need to have a high level understanding.Same as programming. I’d bet most programmers have no idea about the physics that makes computers work.
 - bluerooibos10 hours ago
 > I think you need to ask what you actually want to do with the AI.What about improving the efficiency of token consumption, etc., basically opportunities for improving cost/performance?I keep thinking there has to be a better way to share context with models than dumping entire gigantic skill files of raw text or otherwise into them - I'm betting there's a bunch of low-hanging fruit there.
 - coliveira9 hours ago
 There may be some low hanging fruit, but they're not available to people without deep understanding of how the math works. Well paid people already spend a lot of time thinking about this.
 la_fayette5 hours ago
 i am not sure acctually of the math is acctually that complicated/important. the math around neural networks is calculus/chain rule etc and for model comparison/validation one needs statistics. the required math for e.g. understand transformers is quite accessible.
 - sirsinsalot12 hours ago
 You missed the third and most important reason to learn: fun.Which sums up HN these days.
 - malwrar5 hours ago
 Im also a mere mortal, and after putting a few years into it IMO I’d say people make it much more complicated than it actually is. I failed most of my math courses for lack of interest, but found passion later with the aforementioned SLAM stuff. I have no doubt you or any other programmer could learn this stuff, especially since you can ask ChatGPT clarifying questions.I have no idea about careers at this point, I’m still doing fancy IT work as my day job I and look away from the future with dread. I also haven’t been looking for new roles on the open job market, so who knows maybe there’s multimillion pay packages for anyone who can articulate how attention works in an interview.
 - LatencyKills9 hours ago
 I have a BS in CS (and have been in the field for 25 years). I couldn't understand the transformer architecture until I built a few myself. Here are the books I worked through. I now feel I have a very good understanding of modern LLMs.<a href="https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/B0DNR6TH6X" rel="nofollow">https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...</a><a href="https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandekar-ebook/dp/B0GJ75VLPS" rel="nofollow">https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...</a>
- 10GBps17 hours ago
 Yep. It's nearly identical to the neural nets we were using in the 90s. Back then even a supercomputer wasn't big enough or fast enough to do what we do today.I have to wonder though. Is this all a human brain is? A similar thing to an LLM just scaled exponentially larger. I mean a brain is not just neurons with simple connections to each other. The neurons, axons, dendrites, <insert_unexplained_thing>, etc in a brain are all holding and processing information in different ways and doing it nearly 100% in parallel. That's a really big model.The biological discoveries show how complex a biological brain actually is. Even the tiny brains in a bee or spider are able to solve puzzles and use tools. That's crazy.
 - ctolsen16 hours ago
 No, it’s definitely not what a human brain is. That makes very little sense. The ways we interact with language (and thus conceptual memory) is completely and fundamentally different.
 - rfv672315 hours ago
 Is it different though?If we look beyond written languages which are late inventions of human civilization, oral languages are continuous and build with blocks not words.Chomskyan school misled the entire field of linguistics for decades by ignoring spoken languages.
 - 0xbadcafebee3 hours ago
 Chomsky did the opposite of what you're saying. He didn't ignore spoken language. He said that human vocalization is independent of language, and that the way our brains can manipulate and use sound (a cognitive capability, not specifically an aural one) is the fundamental differentiator that allows us to make compound ideas, and our specific use of language is a byproduct.Example: a programming language's capability to produce complex software does not come from some inherent quality of language. It comes from binary. 0's and 1's, representing basic logic, and that being built on top of with an abstract "tool" called a language. If the binary logic didn't work, the language wouldn't do anything.A dolphin can make sounds, and technically has a language, but they can't manipulate or recursively compound concepts (as far as we can tell) in order to create modified ideas. If they could, they probably would have come up with vastly more advanced fishing methods than the (admittedly novel) ones they have now.
 - uoaei14 hours ago
 It is different, but there may be some universal principles that are relevant more abstractly among both cases. Of particular interest is the empirical notion that statistical models of a certain form will always tend to "average out noise" and "learn meaningful patterns" up to the capacity that those models have for representing said patterns. A parallel notion to this is the hypothesis dubbed "thermodynamic origins of life". The universal principle binding these two seemingly disparate topics is one that seems to underlie any sense of "learning" in physical systems: that semantics of those systems depend on their representational power, and the semantics they do come to represent are the results of adding up many pushes in one "direction" (phase space / state space / etc.) encoding a pattern, and adding up many random noise jiggles will cancel out but give you a first-order sense of variance of those semantic features as expressed by the environment.As this description is so overly abstract, an exercise for the reader is to try to work through an explanation of how, say, a river delta comes to "learn" about its environment by "reacting" to the influences at its borders, and how it "encodes" whatever it is that it learns in the substrate that it inhabits.
 - zaphirplane9 hours ago
 But … how close a simulation is it. I can see why people are wondering
 - redox9913 hours ago
 In the 90s you didn't have norm layers, residuals, attention, and some more.So you're missing a lot of the building blocks that make LLMs. It's not a matter of just having the compute.
 - sirsinsalot12 hours ago
 I think the attention mechanism is so simple but so revolutionary that people forget it.Like the best leaps in thinking, once it is made, is is immediately obvious and intuitive.
 - redox9910 hours ago
 Almost everything in ML is like that. It seems so obvious in hindsight. It's maybe what I love most.Residual connections are so simple, so obvious and so vital. Yet nobody came up with them until 2015?
 sirsinsalot7 hours ago
 I suspect it was considered many times, but the sheer computation scale would make it feel like obscene brute force. It feels like the right shape but too wild to think about implementing.I think as time went on, and hardware got better, it seemed more reasonable to actually think about a viable implementation of what I think was a widespread intuition anyone in ML had that everything's context is everything.It just seemed like a theoretical thing until hardware caught up. Maybe. Perhaps I'm applying a retrospective excuse to why it took so long.
 redox993 hours ago
 People definitely wanted to train deep networks before, but didn't know how. They evdn tried things like training layers independently.I don't think it was intuitive to anyone back then, the vanishing gradient problem was a big deal since the dawn of NNs. I'm not sure what you mean by sheer computation, residuals allow you to have deep networks instead of shallow and wide ones. You can have equivalent parameter count.
 - bonoboTP16 hours ago
 Attention layers were not used in the 90s.
 - spacebacon15 hours ago
 LLMs are semiotic infrastructure. You won’t find a better analogy. The cognitive frame won’t hold.
 - otabdeveloper416 hours ago
 > I mean a brain is not just neurons with simple connections to each other.No, it's not. There are many animals that have extremely complex and even learned behaviour that have literally zero neurons.Clearly "neurons" is an oversimplification just-so story, not a scientific theory.
 - adammarples13 hours ago
 Apparently even single-celled protozoa can show learned trial and error behaviour.
 - formerly_proven14 hours ago
 Do you consider fungi animals or do you perhaps mean animals that don't have a brain/CNS?
 - otabdeveloper48 hours ago
 Yes, protozoans don't have brains and yet they exhibit complex behavior.
 - foxes17 hours ago
 Probably better to not simply reduce it by just saying X is Y then if it has all that extra complexity and capacity.
- root-parent8 hours ago
 I had the same reaction as you, when I learned in detail, how all this works. But then I also learned about superposition and compressed sensing, and now...I am not so sure anymore..."Beating Nyquist with Compressed Sensing" - <a href="https://youtu.be/A8W1I3mtjp8" rel="nofollow">https://youtu.be/A8W1I3mtjp8</a>
- wuschel16 hours ago
 Could you perhaps cite the core papers for LLMs beyond „Attention is all you need“?
 - sigmoid1016 hours ago
 "Attention is all you need" is actually a bad paper if you want to learn about autoregressive LLMs specifically, because it describes a more complicated encoder-decoder architecture while modern LLMs are decoder only. So it's an unnecessarily hard way to get into the subject. "Language Models are Unsupervised Multitask Learners" is probably what you are looking for (aka the GPT-2 paper). This was the first time LLMs really showed what is possible, i.e. they can learn to generalize very well from unstructured data. So no more human labelling necessary, which until then was the primary bottleneck in ML. The paper also lists several key ingredients beyond transformers that are mostly still in place today. This also highlights that there was more to it than just "scaling the transformer algorithm" like many people claim. Most developments since then were about improving training data, until "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" drastically changed the architecture landscape again. Later big developments like thinking/reasoning/chain of thought/inference time compute (whatever you want to call it nowadays) are actually all about training again. They work using the exact same architecture.
 - redox9913 hours ago
 Chain of Thought was kind of an obvious solution that everybody knew was necessary by the time chatgpt / gpt4 came out. It was just a matter of time that frontier labs actually shipped it.MoE was also pretty straightforward, just a bit surprising how well it worked (that you can get away with just 1/32 active parameters), but most researchers would have come up with it on their own probably.The true ground breaking papers are the first two you mentioned (transformers and gpt2), and InstructGPT was also very surprising that it worked so well.
 - blackbear_15 hours ago
 The GPT3 paper is a good starting pointLanguage Models are Few-Shot Learners <a href="https://arxiv.org/abs/2005.14165" rel="nofollow">https://arxiv.org/abs/2005.14165</a>I also enjoyed the papers for DeepSeek and GLM for an overview of all the tricks you need to make these things workDeepSeek-V3.2: Pushing the Frontier of Open Large Language Models <a href="https://arxiv.org/abs/2512.02556" rel="nofollow">https://arxiv.org/abs/2512.02556</a>GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models <a href="https://arxiv.org/abs/2508.06471" rel="nofollow">https://arxiv.org/abs/2508.06471</a>
 - sharma-arjun15 hours ago
 Not a core paper, but I found Formal Algorithms for Transformers [1] (a Google paper from 2022) to have a great pedagogical style.[1] <a href="https://arxiv.org/abs/2207.09238" rel="nofollow">https://arxiv.org/abs/2207.09238</a>
 - barrenko15 hours ago
 I'll add in here <a href="https://web.stanford.edu/~jurafsky/slp3/" rel="nofollow">https://web.stanford.edu/~jurafsky/slp3/</a>, "Speech and Language Processing", with chapters that deal specifically with LLMs and transformers.
- GardenLetter2714 hours ago
 It's not just the architecture but also the data - the decoder only approach lets you train in parallel over blocks of text (no RNN serial waiting), that allows you train on much, much more data.
- sesm12 hours ago
 I would argue that those are not emergent property of the model, but a property of how humans find insights in a plausible guess.
- darksim90517 hours ago
 For anyone who is curious about the first paragraph here, this is actually a great video overview of how LLM works and the tokenization part.Tangentially related: This part always seemed fuzzy to me, especially when dealing with data scientists and how they talk about how 'ML' looks at problems. I had this issue when working at a SIEM vendor where they kept going on about use case development having to be designed a certain way to catch things. It was all very frustrating.
 - cloche6 hours ago
 > this is actually a great video overview of how LLM works and the tokenization partDid you mean to link to the video? I would be interested.
- bluerooibos10 hours ago
 Since you spent a month digging into this, can you recommend any materials/projects to look into to get a decent grasp of how they work?
 - malwrar5 hours ago
 I’d recommend my method of just drawing out the block diagram and drawing out + digging into the math at each step! I’m the kind of person who needs to take time to ask lots of questions before stuff clicks, and if you are too I strongly recommend it.I picked it up from trying to teach myself that SLAM stuff. The papers are very short, but highly information dense and at the time there was no ChatGPT to help me. I got through them by just creeping my way through the math with a whiteboard, and something about drawing it out and having it there in my office made it all click. Trying to watch piecemeal lectures on YouTube or grind through foundational books like MVG just didn’t work for me, I used them instead as references for my drawings.Same happened when I tried learning this GPT stuff. karpathy’s videos were out at the time, but I couldn’t really stay focused on them or connect the math with the code. Most other descriptions I could find were focused on getting you to use their inference library or harness. Assembling the picture together on my whiteboard by focusing on drawing out the block diagram continues to be my personal favorite method for deep understanding of complex systems.
 - LatencyKills9 hours ago
 Not OP but I worked through Sebastian Raschka's "Build a Large Language Model (From Scratch)" [0] and Raj Abhijit Dandekar's "Build a DeepSeek Model (From Scratch)" [1] books.I don't think there is anything in a transformer I couldn't explain in the smallest detail now.[0]: <a href="https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/B0DNR6TH6X" rel="nofollow">https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...</a>[1]: <a href="https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandekar-ebook/dp/B0GJ75VLPS" rel="nofollow">https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...</a>
 - hackinthebochs9 hours ago
 >I don't think there is anything in a transformer I couldn't explain in the smallest detail now.If you're up for it I would love to know how and why positional encodings work
 - root-parent8 hours ago
 Learn about superposition and then you will see nobody really know why this stuff works. Its actually a good interview question to set the bar....
 - LatencyKills9 hours ago
 Well, as I suggested, working through the implementation yourself will give you that intuition. That said, I think the simplest way to explain why positional encodings are useful is that it gives the transformer just enough information to make attention meaningful without negatively impacting any parallel, content-based comparisons.A vanilla self-attention layer is just a set of token vectors. Without positional info, swapping two identical embeddings changes very little about what attention can compute. We can "fix" this problem by using positional encodings. Text that has meaning isn't just a set of characters; the location and order of those characters is what provides meaning.
- pkoird17 hours ago
 aka "the bitter lesson"
- Gmolomo14 hours ago
 Sooooo just because you are able to understand it, it's not worth anything?It doesn't has any impact?Ah wait it does. Mh weird.Why are you not creating a startup and get rich?
 - sarjann14 hours ago
 I mean there is a little something called compute. And other complexity that comes like writing code to efficiently distribute a model across machines.
- dominotw9 hours ago
 > Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.how did you know about the steps and there was math involved. i am curious about your process and you came up with what exactly to learn to unravel the mystery.
- coliveira9 hours ago
 Don't forget the stolen data from books and papers. You'll never get anything intelligent without using the stolen data they had access to.
 - giardini5 hours ago
 WTF?
- golergka13 hours ago
 After building some toy LLMs on my own I came to realise that architecture is not the hard part. Train is.
 - dist-epoch11 hours ago
 That's easy to say AFTER you know the architecture.Einstein special relativity is taught these days in high-schools. Doesn't mean it wasn't the very hard part at some point in time.As they say, shoulders of giants.
- faurroar17 hours ago
 Architectures have evolved significantly since then. DeepSeek v4 =/= GPT-3. Even then, a great deal of complexity lies in everything surrounding the architectures e.g. how do you implement them performantly on modern accelerators, how do you distribute the model across a set of accelerators, how do you post-train, etc. And pre-training itself is a dark art. If you legitimately think that frontier labs are doing something equivalent to whatever you wrote on your whiteboard, you’re clueless.
 - jumploops17 hours ago
 Those are all just optimizations.We still don’t really know why they work, we just know how to build them.
 - trollbridge17 hours ago
 We don't really know why language works with humans, either. If you raise a baby from birth, you kind of observe how it is learning language, but the process is also rather mysterious. My eldest son's first word was to actually imitate a cow mooing, and then after that to imitate a motor noise of a tractor or truck. And then after that a meow. (His first complete sentence was "King Graham fell"...)My next child took a completely different path to language, including skipping all the non-verbal imitations.And then at some point, you just suddenly can two-way communicate with them when you couldn't before, and then after that, they can engage in reasoning.
 - jumploops16 hours ago
 Completely agree!It’s interesting to me how similar attempting to understand LLMs is to neuroscience.“When we turn this bit off, this other thing happens… if we change these weights the Eiffel Tower is now in Rome”We’re basically just probing around and trying to reverse engineer an emergent system.To your point, this system may be quite different from model to model (human to human) although some similarities likely occur.The comment I was responding to tried to belittle the OP’s understanding of transformers, by mentioning that running an LLM at scale is much harder than the simple white board diagram.My point was simply that we don’t know why they work, and all the extra optimizations isn’t the “thing” that makes it emergent.Simply scaling the “GPT” is good enough to see it, so the OP’s awe should stand.(On a side note, what other architectures can we scale to find similar emergent behavior?)
 galaxyLogic2 hours ago
 Isn't the LLM simply predicting what should be the next sentences after user's input, using its algorithm and data it has exatrcted from existing texts on the internet. The algorithm that does that could have many different designs, some better some worse for the purpose of predicting what output makes most sense next?So what is it that we don't understand about why theyr work? The algorithm? We have the code. Why the specific algorithm makes such good predictions? I see it as a generalization of trying to predict who wins Kentucky Derby.
 trollbridge12 hours ago
 Computer vision ends up displaying emergent behaviour. It just "figures out" things.
 - ai_slop_hater16 hours ago
 Human brain capabilities are truly amazing, imagine if people didn’t treat their children as if they are stupid and didn’t constantly lie to them, because kids are stupid right, they wouldn’t understand. What heights could be reached.
 baq16 hours ago
 We don’t treat children like they’re stupid, we treat children like they’re children. A stupid adult is treated very differently than any child.Adults are expected to have their world models approximately correct in terms of physical environment so they won’t accidentally kill themselves by falling off a cliff; then there are the social norms which adults are expected to conform to so everyone is kinda predictable to everyone else so adults don’t kill each other too often over food or mates. Understanding of neither is expected from children.
 Izkata3 hours ago
 Another example, my parents taught me to read at about 4 years old. When I started kindergarten (the year before 1st grade in the US), the teachers and principal didn't believe I could read and I had to prove it by reading a book to them I'd never seen before.I think they're right that kids (at least in the US) are generally treated as less capable than they are, and it ends up slightly delaying their development.
 ai_slop_hater15 hours ago
 You may have been raised properly since you don’t get what I mean. I really envy kids with “Chinese parents” that had them learn math early on and not some bullshit like that if you put your tooth under your pillow, then a tooth fairy will come.
 mejutoco15 hours ago
 I think those 2 are orthogonal. Math still works with Santa or the tooth fairy.
 ai_slop_hater15 hours ago
 Maybe math works but critical thinking doesn’t. There are people who have lived for many decades without ever questioning insane b.s. they were taught as kids.
 beezlewax15 hours ago
 It is possible to have learned both things you know.
 skydhash10 hours ago
 I had to learn maths early (not chinese or asian) and also a bunch of scary stories to make me behave. I would have been glad to learn about fairies.
 trollbridge12 hours ago
 They aren't stupid, but they aren't quite ready to handle the full responsibilities of the world and worry about things they don't need to worry about.My son is very worried about black holes lately when he learned anything that goes into one can't get out. He's pretty concerned astronauts could get stuck in one some day. So I explained to him that Hawking radiation does actually mean you can eventually get out; it just takes some time.I didn't think it pertinent to mention spaghettification, the fact anywhere near a black hole will be really hot, or that cosmic censorship means whatever Hawking-radiates from a black hole wouldn't be an astronaut anymore.It was also fun to hear Hawking speak. He wanted to know if Hawking was a robot. I said no, but he has a robot talk for him. Not quite true, but close enough.
 pmg10115 hours ago
 Because god forbid that childhood, the one time in your life when you don't have any responsibilities, should be fun.
 ai_slop_hater15 hours ago
 Waste 22 years of life without learning anything and then slave away at a 9-5 job you hate. Brilliant strategy. At least you had “fun”. Then blame billionaires or something.
 skydhash10 hours ago
 Childhood only lasts 13 to 15 years where I am. By the time you’re in high school, you can be expected to be responsible in some matters. By 22 you have 7 years of experience in making decisions for yourself.
 - slopinthebag16 hours ago
 Hm, I wonder if it's more that we're shocked such a simple thing (relatively speaking) can work so well.
 - malwrar5 hours ago
 It was precisely that for me! Another commenter captures it well; “the bitter lesson” indeed.
 - otabdeveloper416 hours ago
 We do know how they work. They predict the next statistically most likely token.The "bitter lesson" is that fake-it-till-you-make-it is a valid way of doing knowledge work.(Or not make it, then people will just claim you're holding the LLM wrong and it's not the AI's fault.)
 - klempner9 hours ago
 Sufficiently good iterated next token prediction is an AI hard problem.
 - throw31082215 hours ago
 > statistically most likely token.Statistically most likely in what context, given which preconditions? Because each prompt sequence is unique so the probability of any token following it is unknown.
 skydhash10 hours ago
 It’s not unknown because that’s what the model computes. It’s matrix multiplication just like shaders.
 throw31082210 hours ago
 And how do you know that the model computes it correctly?
 skydhash9 hours ago
 Correctness is based on axioms and rules. You need to define your axioms and rules first before you can determine correctness.If you’re talking about matrix multiplication, I can use mathematical rules and axioms and proves formally that the multiplication is correct. For next token prediction, I can prove that the set of tokens is finite and that the next token is always part of that set.But things like grammar correctness, or semantic consistency over a few sentences are not hardcoded rules in the model. They’re emergent properties, mostly due to the amount and quality of data available for training. Quantization is mostly about how much we can shed without loosing a particular emergent properties (like dithering or psycho acoustic audio compression)
 - perching_aix8 hours ago
 This "they just predict the next statistically most likely token" is such an handwavey and willfully misleading explanation, it's unreal, and I'm so fucking tired of seeing it so incessantly repeated. It's beyond asinine.You know it perfectly damn well that a typical person's idea of statistics is not some insanely high cardinality stateful prediction, but a "well a coin toss is a 50:50, and a lottery win is a 1:100000000". You also know it perfectly damn well that as a result, people will just think that all the sentences chatbots ever produced to them were then just somewhere in the massive training set, letter by letter. This insinuation is often even explicitly appealed to.And that picture is outright false. It's a statistical process, yes, so saying that it does what it does by "just doing statistics" is gonna be a generally correct description, but that's not at all inquisitive to how exactly does it do it, nor is it the zinger you think it is. If you did the aforementioned, you'd just get milquetoast nonsense, like you can see in the countless Markov-chain primers. And while the models do have a lot of the training set lossily captured, they do also absolutely generalize (that's how they can do that lossy compression), and you can quite literally find representations of those generalizations in them, and also see them activate.It's like summarizing how any program works by just saying "well it just manipulates ones and zeroes". Not very informative, is it? Or how programs are written by just programmers sitting in a cushy office, ryhtmically pressing keys on a keyboard. Not a very fair or insightful description, which you'll know if you've done any amount of programming in your life on your own. Extends to all other white collar jobs too.It's also not even true in the most literal sense: models can and do absolutely choose a less than maximally likely next token, that's what the various decoding parameters are for. "Maximally likely next token" further conviently skipping over how that likelihood is established in the first place, i.e. the literal point of the question, going in a cute little circle.I'm so over this "stochastic parrot" bullshit.
 stevenhuang1 hour ago
 I don't even try anymore. The people who still parrot the stochastic parrot bit this late in the game will simply never understand it.
 otabdeveloper41 hour ago
 LLMs predict next token one at a time. (Stochastically.) Literally. It's what they do. That's how they literally work.If you don't believe me, download llama.cpp and see for yourself.P.S. I write inference backends in C++ every day. The gall of people like you who figured out how to prompt Claude and think they're hot shit now is simply unbelievable.
 stevenhuang3 minutes ago
 I help write optimized CUDA kernels for proprietary hardware. They may "literally" work this way, but that is quite besides the point.If you don't see why then you have exactly demonstrated my point in how practitioners like you are lacking the foundational understanding in philosophy, information theory, human consciousness, human cognition necessary to bridge this conceptual gap.
- lowken1013 hours ago
 [dead]
- firemelt13 hours ago
 fucking well said
- robwwilliams8 hours ago
 Great, and won’t we all be just as surprised when human self-attentional control turns out to be just as simple or just as complex! Our minds as a strange fabric built of threads of recursions without the benefit of any explicit clock.
miki1232118 hours ago
There's one thing I wish people understood about LLMs, and it doesn't really have anything to do with what's inside the neural network part. It's the fact that LLMs can only write in one direction — forward.When you are writing an essay and realize midway through a sentence that what you've written doesn't make sense, you go back and edit. An LLM can't do that, the only thing it can do is keep on generating. Because training data typically contains full essays and not half-finished sentences which were then edited, LLMs have a strong preference for "saving face" and producing grammatically correct, internally coherent outputs. They will often do so even if the only way to write themselves out of the corner they wrote themselves into is to lie. To maintain internal coherence, they'll then repeat that lie for the rest of the response.This is also why changing response structure used to affect LLM performance so dramatically. If you asked an LLM to solve a math problem and all-but-forced it to start with the answer, it would have had to calculate that answer before emitting any tokens, something which it very often wasn't able to do. If it was told to follow up the answer with an explanation, it would produce a plausible-sounding explanation to maintain coherence.If, on the other hand, it was told to start by "thinking step by step", it would often be able to solve the first step, and then the next one given the results of the first, and so on, until it was able to reach the answer. Because the answer came last, it wasn't committing to anything, so had no reason to "save face" and lie.This part of the problem is basically solved now with reasoning; reasoning is where all the step-by-step stuff happens, even if users aren't always able to see it. In the process of RLVR, models even train themselves into outputting phrases like "let me check my answer once again" in the chain-of-thought; those serve as their "life rafts" which they can use to both save face and change their answer.
- swyx3 hours ago
 agree reasoning fixed a lot of it. but the inherent path dependence of autoregression is one reason i was excited for text diffusion models (<a href="https://www.youtube.com/watch?v=r305-aQTaU0" rel="nofollow">https://www.youtube.com/watch?v=r305-aQTaU0</a>)instead of going left to right, even with a scratchpad, maybe you start with a rough shape of the big picture all at once, and then you iteratively resolve and things come into focus.mercury (<a href="https://www.youtube.com/watch?v=2fDBeMu6xjk" rel="nofollow">https://www.youtube.com/watch?v=2fDBeMu6xjk</a>) seems to have made the most progress here, which is not saying a ton but is not nothing. i do think it is telling that of the big labs, only GDM has made any meaningful bet on text diffusion. you can bet your ass all of them have evaluated it for a source of alpha.
- chris_money2026 hours ago
 In terms of our brains though we can only think forward as well (if forward is time). Our brain in the future says something we did in the past was wrong (part of the sentence we wrote) and that informs our body (the agent) to go back and fix it
 - kzrdude4 hours ago
 I think that's why pen and paper is such a good tool for thinking. :)
- FergusArgyll1 hour ago
 I get what you're saying but to be slightly pedantic etc.Why can't an llm tool call `delete(index: int)` or `replace(from_index: int, to_index: int, string: str)` and then it can go back and edit just the way we can?We also first made the mistake and only afterwards noticed we actually want to change something
helloplanets14 hours ago
The part about positional encoding is not correct.> The intuition: instead of adding position info to each token’s vector, RoPE rotates the vector by an angle that depends on its positionYou can't rotate the token's entire vector (or all three vectors, whatever is being implied is unclear). You rotate each token's Query and Key vectors only, so dot product can be used to tell how far apart the tokens are when comparing token 1's Query vector to token 2's Key vector.Positional embedding should just be explained after explaining the Query, Key and Value vectors. When the article explains those only after that, the reader is building up on a wrong intuition and it gets confusing.
- rhubarbtree2 hours ago
 Yep, you’re correct. I got to that bit and thought that can’t be right. It’s obviously wrong. If you rotate a semantic vector, you change the semantics of it. You don’t want that.Makes me wonder if the whole thing is just slop.Is the rest of the article correct?Anyone suggest an alternative article?
- giardini5 hours ago
 Could you restate this another way: I don't follow.
10GBps18 hours ago
I learned TCP/IP by watching and reading raw packets over packet radio at 1200 baud.I've noticed the same thing is possible if you watch the output of a slow LLM. Eventually you start to see the machinery. input tokens = output tokens, it's math. I can't exactly predict the tokens generated but I can see how they are formed. It's a lot like chess. You can't see every possible move but the mechanism is understandable.
- trollbridge17 hours ago
 Comment <-> username synergy.
- helloplanets12 hours ago
 It's basically possible build an LLM using just routers+packets, and then hook them up to Wireshark to see it compute!
- Maledictus15 hours ago
 How would I set this up?
 - barrenko15 hours ago
 I'd recommend to maybe also specifically watching Karpathy's videos and focusing on the early parts where he specifically deals with tokenization / embeddings generation (which gets really overlooked), and he does this in most of his videos.
- fragmede16 hours ago
 <a href="https://distill.pub/2019/activation-atlas/" rel="nofollow">https://distill.pub/2019/activation-atlas/</a>I can only imagine what sort of visualizations are going on today inside of the AI labs.
alecco11 hours ago
A better blog on Transformers: <a href="https://www.aleksagordic.com/blog/transformer" rel="nofollow">https://www.aleksagordic.com/blog/transformer</a>
vocram15 hours ago
Saying an article is of inferior quality just because editing was AI-assisted is like saying a book is lower quality just because it was printed rather than written by hand
- hiworld65432 hours ago
 Saying the article was AI assisted is also wrong. Current consumer models can read a vehicle user manual in moments and indicate probable writing errors. This article had a few stylistic errors (or choices) that were irrelevant, except to prove the author is human. If he/she had used AI even in an assistive manner, knowing the nitpicking behavior of the intended audience, it would have had none.Also, the author’s other public writings have similar errors/choices in style. When “consumer AI” writes or rewrites, it’s impossible for it to mimic one’s writing style so similarly. Literally impossible, because it can’t disregard everything else that it “knows” (voluminous training, guardrails, interface design, social boundaries) for it to 1. Disregard all that training after processing the user’s prompt 2. output a complete article in the user’s style 3. Turn back on its knowledge 4. then continue to function. That’s just not how consumer products work.
- Ampersander13 hours ago
 You are exactly right! People do not find the writing obnoxious, they are backwards technophobes getting brought down by their superstitions.
- bspammer14 hours ago
 No? One affects the actual text and the other doesn’t.
- lateral_cloud14 hours ago
 AI assisted is a stretch. And that analogy isn't even close to being relevant
- possibleworlds9 hours ago
 This analogy makes absolutely no sense.
- Laurel123414 hours ago
 Rather interesting than clanker slop defenders downplay the clanker aspect and highlight the human by calling it "ai-assisted", which defeats their entire point.I hope you do some introspection and start consciously recognizing that the human input and the clanker slop is just debasing it.
- janalsncm14 hours ago
 Not just that, I think a lot of people are going to waste their time losing the battle (and make no mistake, they will lose) fighting against AI writing without ever asking themselves what makes writing good in the first place.There’s good AI writing and bad organic writing. But it’s easier to point out a few LLM-isms than to actually identify the problems with text.
 - blharr8 hours ago
 > There's good AI writingSure, but the LLM-isms in AI writing are mentally exhausting to see in every way at this point.The whole point of reading, frankly, is to understand the voice of other people. When you pass that through a distorted filter that makes everyone sound the same... its bad, lossy, frustrating communicationIt's also dishonest. When you publish something that is direct output without your wording. Digital catfishing at best.The only good AI writing is providing the prompt, because the question is way more interesting, and way more constructive to learning than the answer
 - janalsncm2 hours ago
 The point of writing is to convey an idea to another person or yourself at a future date. Authenticity has nothing to do with it. I frankly do not care about the “authentic voice” of the author of a random blog. I want to know if they have any interesting ideas.
 - wj4 minutes ago
 I think because so much of an idea is shaped by the language used to convey it, it may be hard to separate the person from the LLM.I think gp may want to know if a <person> has an interesting idea rather than <person + llm>.
andai19 hours ago
I couldn't load the article directly due to an SSL issue, so here's the archive link:<a href="https://archive.ph/aWtFG" rel="nofollow">https://archive.ph/aWtFG</a>
metaquestions2 hours ago
Thanks for putting this together. I found it very helpful
oceansky6 hours ago
Out of curiosity, I wondered if you could break a tokenizer by introducing weird characters not mapped to an id.But apparently, they either just emit a [UNK] token or translate the unrecognized character into raw UTF-8 bytes.
AltruisticGapHN14 hours ago
I don't like how most LLM explainer articles and videos say that essentially a LLM " predicts the next word".I'm a developer but not very good at maths and I still don't understand any of it.A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.How is that "predicting the next word"?Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".What I mean, is the LLM is able to represent things in space . That part I don't understand.I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?
- dev_hugepages13 hours ago
 Predicting a word is the final objective, as in the output of the model is a probability distribution of the next token. However, choosing the right token is more complicated than just regurgitating the training data (and you won't encounter an exact example in the training data, so you need to interpolate). This makes the model learn abstract representation of things that it is able to manipulate before outputting this back into token. RL also complicates this because the "fitness" is now some arbitrary metric computed over an entire sequence of tokens.
- Borealid13 hours ago
 Your casual understanding is imprecise.At all times the LLM is, indeed, predicting the next token. Anything it does emerges from that.It did not "figure anything out". It predicted that text describing the use of a radial gradient was likely to follow text describing your problem.
 - hackinthebochs9 hours ago
 >At all times the LLM is, indeed, predicting the next tokenThe point is that saying they're just "predicting the next token" is not at all explanatory nor providing insight. Saying the brain is just firing action potentials gives you no understanding about how the brain does what it does or what the space of its capabilities are. Similarly, predicting the next token tells you nothing about the capabilities of LLMs.
 - galaxyLogic2 hours ago
 True, but that is a great fact to start from, and understand.Then the next question becomes "HOW do they predict the next token?" There are many ways that can be done, why is this particular algorithm so GOOD?"When people say "We don't understand how LLM works" isn't it really saying we don't understand how this specific algorithm used to predict the next token works? No, it is not, because "we" do understand how all those algorithms work there are many descriptions of them available.So the question then really is "Why is the prediction this algorithm makes, so good, as compared to some other statistical algorithms?"It's not about "Why does AI work so well?". It should be "Why does this particular XYZ algorithm work so well?"
 - layla5alive12 hours ago
 Lol, the bird did not 'fly' - it just flapped its wings and generated lift!
 - Borealid9 hours ago
 No. The how is relevant here because it leads to understanding of the resulting behavior.If you train the LLM on a corpus that shows people saying the sky is red, you get an LLM that is predisposed to say the sky is red. This is true even if it's also trained on all of the science that explains how and why the sky is blue.If it were to "figure out" or "reason", it would not have such a predisposition to emit "red" after "the sky is" just because that matches the reward during training.In other words, the token prediction is important because it both explains the successes AND the failures of the LLM. If there were situations in which a bird could fail to fly, then how it tried to fly would also be crucial knowledge.
 - layla5alive3 hours ago
 You can also teach humans science and math and then they can be trained by a cult to not use any of that reasoning when emitting canned responses that they were rewarded by the cult for internalizing during their training. "Fake News!"You're caught up on the mechanics of token processing (floating point matrix ALU math) and ignoring the context that p(next token) as a function being "computed" is doing so over a trillion parameters. You can poorly train a model, sure, but assuming you don't indoctrinate it too much, properties like cognition emerge - it learns to reason; why? Reasoning is more efficient and compact than memorizing answers.
 Borealid2 hours ago
 I completely agree that humans sometimes are not applying reasoning to things.I'm not trying to argue a model cannot "reason" or have "cognition", whatever those things are. I'm only saying that it's absolutely the case that whatever those things are, they come from its mechanism of predicting one token at a time ad infinitum, and that throwing away a deep understanding in favor of a shallow one is foolish. Just because it might seem to be "reasoning" does not mean it IS doing so, and certainly giving the appears of reasoning does not mean it is NOT a token predictor.If I knew deeply how the human brain works I would use that understanding instead of saying things like "this person reasons" or "this person thinks".In summary, I'm not "caught up in" anything - I'm just trying to point out that the original poster here is incorrect in saying that clearly LLMs aren't working through token prediction. They are, and all their behavior is 100% explained by token prediction. That's more than enough for interesting behavior!
 - qsera12 hours ago
 More like being suspended by a thread...
- antran2213 hours ago
 It's still predicting the next word. Somewhere in the gigantic dataset that the LLM was trained on, there is a phrase that says "gradient border" being in the vicinity of a CSS code that render the stuff. Therefore when you run it on an inference loop there's a good chance it output that CSS code when you tell it to render a "gradient border"Multi-modal models that can understand visual input do exists, but no such visual reasoning process happened in the example you mentioned. Not unless you have a visual feedback loop in the coding harness.I'm not dismissing the capability of "predicting the next word" however. The vast amount of training data enable extremely complex and useful behavior you just described.
 - 360MustangScope12 hours ago
 What about things it wasn’t trained on?For instance I’ve written a few custom languages to learn how to write a VM and the lexer/parser/compiler/etc. that it had never seen before and then just gave it the syntax which is different than what it had ever seen before. Simply due to the fact I made it and it had never been trained on it.After giving it my documentation, it was able to write the language just like a language that it had been trained on. I’ve also seen this behavior at work where there are weird quirks to do things and definitely not standard and it can handle it.
 - qsera11 hours ago
 Because in its training data there is information on how to map from documentation of a language to actual programs. This means that following the pattern it can map between documentation for any language to programs in that language.But I think it will have difficulty in crossing paradigm boundaries, by simply using documentation.
 - skydhash10 hours ago
 That’s because it does not encode words or keywords or anything like that. It encodes their relationships. A formal language like a programming language are pretty compact. There’s not much variation between the C-like languages. Just like most Lisps (clojure, scheme, elisp, racket) are fairly similar to each other.The exact syntax does not matter, only the grammar. If you give it the grammar, and then the keywords, it can find something that has similar grammar and then use your keywords.
 - YeGoblynQueenne8 hours ago
 I'd be very careful assuming something is not in an LLM's training set. Those data sets are truly vast. And, from experience, people tend to miss a lot of their content.As a for instance, back in the day some academics wrote a paper that compared GPT 3.5 to a couple of inductive programming systems (including one of mine) on solving programming problems in a certain well-known esoteric language which I shall call "L". The task was to solve those programming problems one-shot. The authors asserted that the "L" problem sets were unlikely to be in 3.5's training set, but I found them without much search in a public github repo. I mean the entire dataset was right there. In this case the researchers are colleagues and friends and I know they weren't simply negligent or malicious, they just missed the fact that their "unlikely to be in the training set" data was on the web.So I'd always assume that if an LLM can perform a task that's because it's seen examples of the task during its training.Without forgetting that LLMs have this really shockingly powerful ability to interpolate between examples and they can improve their performance on say Task A by training on Task B, where A and B are different but similar.e.g. they seem to get better at translating between language pairs of which they have few examples of parallel text by training on other pairs of languages for which they have more parallel text; they seem to learn something about language translation in general by training on more examples of translation. I haven't got a good reference on that handy but it's well-known (and of course over-hyped and exaggerated by tech CEOs).So without wanting to diminish your work, I'd guess that your new language's syntax is different and novel but everything else about it is more ordinary and the similarities are such that an LLM can wing it and write you a lexer etc. After all, the whole point about parser generators and similar tools is that the task can be abstracted and separated from syntax in the first place.In fact LLMs are very good at that sort of thing, filling in the blanks as it were. I'm old enough to remember the excitement about GPT 3.5 being able to form syntactically correct sentences with nonsensical words give to it.For example, I just asked Chat [1]:<pre><code> Hey chat. The gostak distims the doshes. What happens to the doshes? </code></pre> And it promptly answered:<pre><code> The doshes get distimmed. </code></pre> See, it even got the spelling right!_________________[1] <a href="https://chatgpt.com/c/6a242b65-e248-83ed-9a6e-f238a1e871b6" rel="nofollow">https://chatgpt.com/c/6a242b65-e248-83ed-9a6e-f238a1e871b6</a>
 - layla5alive3 hours ago
 I could answer the same query the same way as a child.
 - layla5alive3 hours ago
 I do not think you are correct.
- Marha0111 hours ago
 LLMs fundamentally work by predicting the next word (token). But that should not be used to diminish their potential capabilities. It's like saying that human brains "just predict (or produce) the next electrical impulse". Fundamentally correct, but says nothing about the potential emergent capabilities of scaled-up systems that work like that.Emergent properties of complex systems should not be diminished just because the underlying operating principle is simple.
 - layla5alive3 hours ago
 So much this - so many people seem to miss the forest from the trees that emergent properties are not bound to the complexity of the underlying mechanics.All of life arises (maybe) from very simple subatomic particles, and at each stage you can repeat this refrain, complexity increasing as you stack.
 - galaxyLogic2 hours ago
 Game of Life comes to mind: Most simple logic, emerging patterns are hard to believe.
- raincole8 hours ago
 > What I mean, is the LLM is able to represent things in space . That part I don't understand.Why do you think this is mutually exclusive to "LLM predicts the next token"?If you tell someone from 19th century that bytes (just 0s and 1s!) can represent an opera, a song, or even a whole interactive experience, they might be really confused. But there is no reason they can't.If you tell someone without math background that the sums of smaller and smaller sin waves can represent pretty much anything in our universe, they might be really confused. But there is no reason they can't.There is simply no reason that a next-token predicator can't generate a nice-looking checkbox.
 - layla5alive2 hours ago
 You're talking about simple compression and encoding mechanisms and by implication you're drawing an analogy to an LLM encoding/compressing the information..And sure, it does, but the person you're replying to was trying to understand why it also seems to reason about the query to give an answer consistent with it, despite not being trained on that query or answer. Your answer seems to imply that its just another slick complex encoding.But the emergent property of trillions of digital neurons predicting the next token is that in the process of being trained to do so, they can also learn to reason.At some scale, it is efficient to encode cognition which is capable of mimicing the cognition which generated the input tokens.
- nchie13 hours ago
 I understand that to be the "emergent abilities" which are spoken about. There are correlations in the dataset that are strong enough for it to seem to have an understanding which wasn't obvious it would have from simply "predicting the next word".
- qsera12 hours ago
 >is the LLM is able to represent things in spaceIt is imitating the text written by humans who can represent things in space.
- YeGoblynQueenne8 hours ago
 Sorry you're being downvoted for asking a very reasonable question. I don't think any of the replies here address your question either.If I can do my best to answer, Gemini is a multi-modal system. That means it's trained not only on text but also still images, video and also sound. The training happens in parallel and the representation of each modality is usually different, so the image recognition part is not trained on text tokens but pixels, the video part (probably) on video frames etc. There is some kind of integrated training that goes on so that text can be generated that is correlated to an image and so on, but I don't know the specifics about Gemini in particular. This kind of thing is not exactly new either, you can find systems that captioned images before the rise of LLMs simply by training on examples of images coupled to their textual descriptions.In that sense it's not entirely correct to call Gemini an "LLM" because it's not only a "language" (or, more precisely, text) model. But LLM I guess becomes a bit of a shorthand for everything based on, or combined with, an LLM.Anyway that's what's going on: it's not just predicting the next word. It's also predicting the next image frame or the next set of pixels etc associated with the next word.
- locallost12 hours ago
 I don't want to pretend I can explain LLMs, but the same "math" can be applied for visual and non visual things. The dot product of two vectors gives you the angle between them. This is true in 2 or 3 dimensions. But it's also true in 4, 5, 6...n dimensions even though we cannot visualize a 4d space. That it's an angle is relevant for you in the space you can comprehend, but for math or a machine it works in any number of dimensions. So it does need to understand anything visually if the math checks out.
- throw31082213 hours ago
 LLMs are modelled to predict the next token, and are indeed trained to do so on enormous bodies of text. But to be really good at predicting the next token (word) at the end of a long string of text, you must understand what the text means. If I give you the entire text of a long novel and at the end ask you a single "yes/ no" question about the plot, you only need to emit a single token, but emitting the correct one implies having understood the plot of the novel. This is what LLMs do. They're generating meaningful, coherent text, which implies understanding and cognition at a level that is much deeper than that of the single token they generate at each forward pass. Internally, the LLM has learned to represent the meaning of the entire prompt text, the concepts it implies and its possible continuations far beyond the horizon of simply outputting the next token.
 - otabdeveloper48 hours ago
 > This is what LLMs do. They're generating meaningful, coherent textNo, they generate grammatically coherent text. That is because human language grammars are fundamentally mathematical structures that can be approximated with matrix operations.They don't generate meaningful text because they have no inherent knowledge of the world.If you've used LLMs for any amount of time you've already noticed how often they get confused about numeric quantities - like confusing notions of "bigger than" and "less than" or being unable to count letters in words.This is because any meaning in their output is only accidental.
- MagicMoonlight11 hours ago
 It can’t. It’s like a Redditor, it just repeats what it has seen other people say.It has read all of stackoverflow, so it has seen your kind of problem before. Try asking it something really unusual and it will shit the bed.
 - mjmsmith8 hours ago
 > It’s like a Redditor, it just repeats what it has seen other people say.Can stochastic parrots understand irony?
- Ampersander13 hours ago
 I do agree bigly. Calling what is basically a superhuman brain inside a computer just a "token predictor" is peak thinkslop.
 - otabdeveloper48 hours ago
 Inside the magic AI box is literally nothing but this loop:<pre><code> int n_tokens = 0; while (n_tokens < TOKENS_MAX) { int next_token = decode(context, ++position); print(token_to_text(next_token)); ++n_tokens; } </code></pre> If you don't believe me then just download llama.cpp and see for yourself.
 - layla5alive2 hours ago
 decode() looks simple! Wow, obviously intelligence can't live behind that function call! /sNow, take that for loop, and replace the implementation of decode(context, ++position) and pass it to a human who was bored enough to play along and use a notebook to organize their thoughts and translate them to/from this encoding (you might write a helper function to do this for the human in the front-end of the new decode() impl, but the data flow in and out of decode() will remain the same):decode(context, position){<pre><code> cached string_answer = ask_human_question_via_context(context); return decode_human_answer_to_tokens(cached string_answer, position); </code></pre> }Is the output you get not thinking anymore because it passed through this harness? Did the human's mind somehow get reduced to mere interpolation?The human mind is still a human mind. Putting a simple harness in front of a mind does not affect its fundamental properties.In an LLM, decode() is calling into a trillion parameter connectome.
 - otabdeveloper41 hour ago
 The LLM predicts next token one at a time. (Stochastically.) This is a literal truth.Deal with it.
yukIttEft13 hours ago
> so the model figures out during training what each token should look for and what it should offerBut how does it learn this token-relationship?All it has is many text samples, but still, nowhere it says how the tokens relate to each other, so where does this information come from?
- HarHarVeryFunny7 hours ago
 The model is just trying to map from sequence to next token. You could say that it doesn't really care about the relationships between words/tokens - it is just being trained to learn the best attention/etc weights to make this mapping as accurate as possible.The model could just as well learn to predict next token from gibberish text as long as there were some statistical gibberish regularities to learn. However, if you train it on real meaningful text then the statistical regularities it needs to learn (and will, thanks to gradient descent, and the capable architecture) will be those reflecting "token relationships" - grammar, semantics, etc.So, you can say the "token relationships" (incl word meanings) are reflected in the statistical regularities of the training data, and the model architecture and training algorithm are just very capable of learning those regularities whatever they may be.You can consider it related to Word2Vec word embeddings, which are based on the idea that the meaning of words comes from how they are used, which to a first approximation can be implemented by considering the meaning of words to be defined by the words they appear next to(!), which is what the Word2Vec embedding training algorithm does, and famous examples such as "(king - man) + woman = queen" prove that this is in fact learning the meanings of words.
- inkysigma13 hours ago
 At a high level, the text samples are how the relationships are derived. If we treat text samples as sequences of tokens, then the sequences of tokens describe the joint distributions they occur together which confers the relationship between them. Iirc, this is related to the idea of the distributional hypothesis in NLP: the idea the semantics of words should be similar if they occur in similar situations.
- dist-epoch11 hours ago
 How does evolution learn the form-fitness relationship?It's the same thing here, you randomly try various token-relationship values and the ones which are slightly better will be favoured.
- MagicMoonlight11 hours ago
 If I handed you thousands of documents which said “Jan-Michael Vincent” all over them, would you need to understand who that is in order to notice the relationship there?
zenfoxai7 hours ago
Nice article but chain of thought is what makes frontier LLMs smart, not really the token loop
- brcmthrowaway5 hours ago
  Is chain of thought same as test time compute?
whyage9 hours ago
Style nit: the transitions between dark-mode text and large diagrams with a snow white background are jarring.
spaceisballer3 hours ago
Solid read especially for someone not in this field. While everything I’ve learned about LLMs has been pretty interesting, all I can say for sure is that I’m more and more skeptical about wide scale adoption. Consumers are being pushed almost to the level of coercion to utilize LLMs. Especially in the case of the government, who in most cases will get a free pass for a year to help build that addiction before the real bill comes due. I’m sure people will object to LLMs being considered statistical bullshit generators, but I cant stop seeing that they just generate bullshit. Bullshit can be believable, sometimes bullshit is truthful. But the public is generally accepting that LLMs churn out truth, and they trust them blindly. I truly don’t see the upside, and the argument being pushed by the dealers is that we need to keep using. The real breakthrough is around the corner. It’s my believe that the tech bros are the new robber barons, they’ve moved on from crypto and NFTs and found something that is pretty impressive but a far cry from the GenAI that wish for. I just feel that technology was supposed to free us and also connect us, but instead it’s all things addictive and consuming.
- Gareth3212 hours ago
 I’m kind of amazed when I read comments like this, but I have to remind myself that I work in an industry which use these tools at the cutting edge and see what they can really do. In the space of 18 months I changed from skeptic to the belief that our world is going to RAPIDLY change, and soon.I sense that statistics and benchmarks and research and statements from the world’s greatest academics won’t sway you, so maybe I’ll give you a personal anecdote. I have suffered from a condition my whole life called bile acid malabsorption. It caused chronic diarrhoea, pain, arthritis, dehydration, insomnia, and more. I spent decades searching for an answer. Dozens of different tests. Eventually doctors just said I was depressed and prescribed me antidepressants. They didn’t help. On the bad days I considered ending my life.In desperation I turned to ChatGPT. Over months I described my symptoms, triggers, diet, timing, etc. We “sparred” with each other over assumptions and ideas. I gave it all my medical history. All the tests. Eventually it concluded that BAM was likely (plus another few options). So I pushed my doctor for a specialist referral. The specialist agreed to a scan based on the symptoms. It was confirmed. I’ve been taking some cheap medication each day now and it has changed my life.I know others for whom ChatGPT has changed their lives in similar ways. Research shows LLMs are better than doctors already in many cases at diagnosis. They are improving at an exponential rate.
 - spaceisballer1 hour ago
 I have no doubt there are similar stories to yours or to a lesser extent. And while I’m glad you have gotten some relief, what price are we paying as a society? I’m not just talking about the environmental impact alone, what about the societal impact? Social media has not been around too long but it seems to be a net negative for society. I’m sure there are plenty of anecdotes to the contrary, but the studies are showing more and more of the detrimental impact to kids/teens. The stories and testimonies of how it was engineered to be addictive. Yet we’ve mostly moved on and allowed money more social media to get owned (and now traditional media) and controlled by the rich. But your statement is right, I doubt I’ll get swayed easily. Anecdotes and proclamations work when thinking short term. But taking the long view or looking at recent history I don’t see a positive. We have the hype being brought to us by the people that gave us the dot com bubble, crypto and NFTs. But hey, go fast, break things and never ever think about if what you are doing is really beneficial to society (after us, they are not us.)
agumonkey6 hours ago
Nice intro, gonna help me dig further a lot now. Thanks a ton.
melvinroest15 hours ago
I thought Karpathy’s microgpt explain how LLMs work
- disgruntledphd215 hours ago
  Microgpt is really good, if you want to understand exactly what happens. I still thought that this article was a good, higher-level complement to that article though.
dotdev_prem4 hours ago
What is the minimum Nvidia graphics card required to train a 1B model from scratch??
stalfie14 hours ago
This article describes how Transformers work, but not really how LLMs work. Explaining the underlying architecture gives you about as much insight into how a modern LLM behaves as an breakdown of neuronal biochemistry and a few pathways does for the brain. Meaning, almost no insight at all.
cubefox15 hours ago
We are living in a crazy science fiction world where on the top of the HN frontpage there is an article on how LLMs work which is likely itself LLM generated, and the only way to tell is its writing style rather than its factual accuracy.
rishbz12 hours ago
Great insights. RL training is the key
aabdi16 hours ago
this is hard to read...it goes all over the place.i'm not actually sure who your target audience is.there's too many side tangents.just like, structure it plz.1. customer feels bad cuz they don't understand how llms work2. provide high level abstracted explanation (don't dive into concepts yet)3. provide breakdown guide of overall set of components.4. walk through each component. don't side track. no need to explain, ROPE,GQA etc... it just distracts.i.e. customers don't know how llms work, leading them to feel bad about their own intelligence.at a high level llms take in words, do some math on them, and then produce words, one by one.inside llms have these different components. we walk through them step by step.1. tokenizer2. embedding3. attention4. heads5. ffn6. sampling## tokenizer
- barrenko15 hours ago
 It's just slop.
lhd118 hours ago
find it difficult to engage with AI generated text. What am I getting here that I couldn't get from a chatbot.
- blackoil17 hours ago
 Hopefully someone has asked right questions and removed confusing answers/hallucinations.
- dialsMavis18 hours ago
 Is this text generated by AI? I couldn't tell but I'd believe it if it was.I imagine if resources were spent writing this text then one benefit of using it is not using more resources or the pollution caused from a chatbot.
 - zemo17 hours ago
 normal people talk and write with some notion of meter, the cadence of communicating where pauses are inserted at places that naturally suit the speaker (and listener) to pause for thought. LLM's don't really do that, they just write a bunch of sentences.> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. One neuron might activate strongly on Eiffel-Tower-related text. Another on programming languages. Another on past-tense verbs.People don't really write like this and they don't really talk like this (and no, people don't necessarily write exactly how they talk because they don't read exactly how they listen; the written word can be backtracked while the heard cannot, and speakers/writers know this, either consciously or unconsciously). A person would probably structure this more like:> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. For example, there could be one neuron that activates strongly on Eiffel-Tower-related text, another that activates strongly on programming languages, a third neuron activating on past-tense verbs, and so on.Usually people wouldn't write "Another on programming languages." as a standalone sentence like that because the periods introduce an unnatural pause like they're giving a TED talk, unless of course they were punctuating that way for effect, but you'd essentially never communicate with that effect full time.
 - mattnewton16 hours ago
 I don’t disagree with your conclusion that this is likely ai rewritten, but I do find it strange that you say “normal people don’t write like this” when it is mimicking how people write, and using patterns I have seen people write. I think models are at the point where style is not really reliable as an indicator anymore.
 - Izkata3 hours ago
 A lot of the common patterns people ping as AI (like "it's not X, it's Y"*) are marketing-speak, of which there's a lot of on the internet. It's applying existing patterns in unusual locations, ignoring the original context.The one they're pointing out (the short punchy sentences) also apply to things like politicians and news articles. Blog posts are a bit odd.* And here I mean those literal exact words. People are also extrapolating to similar patterns that use different or more words than "it's not" and "it's", but those flow better and aren't what I'm referring to here.
 - AgentMatt16 hours ago
 I'm sure there's plenty of writing in the above style to be found on the Internet, and hence having been trained on by the LLM. I'm also not a fan of this style, and in particular I'd say it's rarely or never found in scientific / technical writing meant to convey understanding rather than sell or hype. So here it's IMO more of a style mismatch.
 - zemo10 hours ago
 It’s not a model of an author, it’s a model of documents. That’s not the same thing.
 wizzwizz43 hours ago
 No, but sufficiently-advanced overfitting would lead to to the model keeping track of an author stylistic profile, in the same way it keeps track of the plot of a story it's writing (i.e., badly, but well enough that you have to pay attention to notice that something is wrong).
 - thin_carapace15 hours ago
 people sure do write like that, in novels. nobody writes scientific articles like novels, because scientific articles don't need to maximally capture audience attention. the purpose of a scientific article is to convey information - this pursuit is not assisted by punchy prose.
 - MagicMoonlight11 hours ago
 It is trained on its own slop. They haven’t trained these models on books for three years at this point. Only on generated slop. (And RL slop upvotes/downvotes from users)
 - rippeltippel17 hours ago
 The voice of several passages resembles ChatGPT very closely.
mathisdev78 hours ago
very interesting and useful!
singpolyma319 hours ago
Next do "why LLMs work"
- inkysigma13 hours ago
 This is essentially an open research question. ML theory is unfortunately very weak relative to where the empirics are. I think there's a relatively optimistic paper that was posted a while back here but I would also take it with a grain of salt.<a href="https://arxiv.org/abs/2604.21691" rel="nofollow">https://arxiv.org/abs/2604.21691</a>There's of course empirical results and relatively weak theoretical results like the UAT but I also don't think that answers your question fully, especially since it seems impossible to definitively answer questions that the industry seems to betting on like whether or not there is a lower bound to their error rate or whether hallucination as a problem can be solved. We have much stronger ideas of what linear regression is doing relative to what LLMs are doing.
- sheeshkebab19 hours ago
 considering they work with any architecture/configuration given enough compute, just more or less efficiently - then maybe it's fundamental, in the same sense as why electricity works...
- krackers17 hours ago
 See Tegmark's "why does deep cheap learning work so well" (well not so cheap anymore...)<a href="https://www.youtube.com/watch?v=5MdSE-N0bxs" rel="nofollow">https://www.youtube.com/watch?v=5MdSE-N0bxs</a> is remarkably prescient given that it was written before LLMs
- soupspaces19 hours ago
 Universal approximation theorem, embeddings, self-attention, gradient descent. And empirically, scaling laws.
- qsera11 hours ago
 Because there are patterns everywhere!
- skydhash18 hours ago
 Why does linear regression works? Why does computer works? Because it's about math and the encoding information. If we can encode words as numbers, then why can't we encode their order as a relation? It's just that neural networks are very apt at finding that relation even if it's noisy.
lateral_cloud15 hours ago
I don't understand how these AI written articles get so many votes.
- alansaber8 hours ago
  There is a very high volume of them being posted every day, and they are a significant % of the total. Also, writing is hard, LLM articles can be slop whilst also being better written than average.
lionkor15 hours ago
It sucks that this article is clearly LLM edited, with common phrases like "same shape as", "the intuition: ", and the "tiny explainer" which clearly generalized from a prompt accidentally.Good article, but when sharing it I will have to preface "yes it's slop, but it's a good explanation".Absolutely embarrassing that the author didn't catch that these LLM-isms are a (and here I'll use one) bad signal.In fact, I would go so far as to say that publishing in this style stems from a lack of reading experience and writing experience, which does not bode well for someone pretending to be an expert. I gave this article to someone highly intelligent who doesn't know the first thing about how LLMs work internally, and she immediately called out that it reads like AI text.
- Ampersander13 hours ago
 You're not supposed to read it, just like you're not supposed to write anything anymore. Claude can read and write more than any human. We just lean back and relax now.
- janalsncm14 hours ago
 I don’t think it’s absolutely embarrassing. First of all, the point of the author writing at all is to aid understanding, not produce prose. So from that standpoint, what would be embarrassing would be to include incorrect facts that suggest a fundamental misunderstanding of the topic.From my read, it is fine. The brief history of LLMs is complicated since every single component has papers introducing enhancements. So it’s easy to ignore them or get bogged down with details.The author appears to be a security researcher learning about LLMs for the purpose of defending against common attacks. So this piece is that person giving themselves a crash course on the topic. The fact that they cleaned up their notes with an LLM is frankly completely irrelevant.
codeakki17 hours ago
What's the point of this? Im not here to engage with AI bots
whateveracct17 hours ago
accidentally quadratic
eddysir7 hours ago
[flagged]
transkey15 hours ago
[dead]
mgc_blackbox13 hours ago
[dead]
rbrown4615 hours ago
[dead]
runfuyngunasdlj8 hours ago
[flagged]
spacebacon15 hours ago
But how do they “think”? This is the only repo that can tell you that.<a href="https://github.com/space-bacon/SRT" rel="nofollow">https://github.com/space-bacon/SRT</a>