11 comments

  • fxj6 hours ago
    After reading the article it seems to me that this is more like synaptic pruning where weak connections between neurons are eliminated in order to increase the efficiency of the neurons. Interesting to see that this also works for LLMs.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Synaptic_pruning" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Synaptic_pruning</a>
    • xpuente8 minutes ago
      The issue is that no one fully understands why synaptic pruning occurs in biology. Large language models have no direct connection to biological systems, and pruning in LLMs is no exception.
  • agroot124 hours ago
    I might be missing something, but it would be great if the charts would show inference speed, model size (required VRAM) and quality (benchmark results) in one. It might be that the same quality and speed and size can be attained by just quantizing, perhaps with added fine-tuning, without the sparseness. The post seems to imply that their method is better, but if that&#x27;s the case, they could show that.
  • celltalk3 hours ago
    All of these smaller model paradigm suggests that we need to incorporate pruning into model training. Neat was one of my favorite algorithms of all time. Same thing with BitNet models which keep showing the information you need is not that much for neural networks. And again, it is same with us, we use much less energy than a regular network so there seems to be immense waste of energy training these models.<p>My intiution tells me the pre-training paradigm will shift immensely in near future because we started to understand that we don’t need all these paramaters since the subnetworks seems to be very robust preserving information in high dimensions. We keep saying curse of dimensionality but it is more like the bliss of dimensionality we keep seeing. Network redundancy still seems to be very high given BitNet is more less comparable to other LLMs.<p>This basically shows over 50% of the neural net is gibberish! The reason being is that the objective function simply does not include it.<p>Again my intiution tells me that neural scaling laws are incomplete as they are because they lack the efficiency parameter that needs to be taken into account (or simply left out due to greed of corporate).<p>And this is what we are seeing as “the wall”.<p>I am no expert in neural network theory nor in math but I would assume the laws should be something in the vicinity of this formulation&#x2F;simulation:<p><a href="https:&#x2F;&#x2F;colab.research.google.com&#x2F;drive&#x2F;1xkTMU2v1I-EHFAjoS86upFb0o8Bw4r6t?usp=sharing" rel="nofollow">https:&#x2F;&#x2F;colab.research.google.com&#x2F;drive&#x2F;1xkTMU2v1I-EHFAjoS86...</a><p>and encapsulate shannon’s channel’s capacity. I call them generalized scaling laws since it includes what it should include in the first place: entropy.
  • jbverschoor6 hours ago
    LLobotoMy
  • devsda7 hours ago
    I don&#x27;t understand LLMs enough to know if this is a silly question or not.<p>Is it possible to build domain specific smaller models and merge&#x2F;combine them at query&#x2F;run time to give better response or performance instead of one large all knowing model that learns everything ?
    • benob4 hours ago
      You might want to look into &quot;task arithmetic&quot; which aims at combining task-specific models post-training. For example:<p><a href="https:&#x2F;&#x2F;proceedings.neurips.cc&#x2F;paper_files&#x2F;paper&#x2F;2023&#x2F;file&#x2F;d28077e5ff52034cd35b4aa15320caea-Paper-Conference.pdf" rel="nofollow">https:&#x2F;&#x2F;proceedings.neurips.cc&#x2F;paper_files&#x2F;paper&#x2F;2023&#x2F;file&#x2F;d...</a>
    • RossBencina7 hours ago
      I think that&#x27;s the intuition behind MoE (Mixture of Experts). Train separate subnets for different tasks, train a router that selects which subnets to activate at inference time. Mixtral is a current open model which I believe implements this.
      • ljlolel7 hours ago
        No. MoE tends to change expert every other word. There’s a bit of pattern (like a lot of punctuation to one expert) but it’s not clear what. Nobody understands how or why the router chooses the expert. It’s so early.
        • qeternity5 hours ago
          It&#x27;s got nothing to do with words, and many MoEs route to multiple experts per token (the well known Mixtral variants for example activates 2 experts per token).
        • j16sdiz6 hours ago
          &gt; Nobody understands how or why the router chooses the expert. It’s so early.<p>Nobody understand how LLM works either. Is LLM as &quot;early&quot; as MoE ?
          • xvector6 hours ago
            LLMs are really well understood, what do you mean? You can see the precise activations and token probabilities for every next token. You can abliterate the network however you&#x27;d like to suppress or excite concepts of your choosing.
            • menaerus2 hours ago
              Mathematically speaking LLMs have very precise formulation and can be seen as F(context, X0, X1, ..., XP) = next_token. What science behind the LLMs is still lacking is how all these parameters are correlated one to each other and why one set of values is giving a better prediction than the other set of values. Right now, we arrive to these values through experimental approach, that is, through trainings.
            • ben_w2 hours ago
              There&#x27;s various layers of understanding.<p>If you will excuse analogy and anthropomorphism, the human analogy of what we do and don&#x27;t understand about LLMs is, I think, that we understand quantum mechanics, cell chemistry, and overall connectivity (perceptrons, activation functions, and architecture) and group psychology (general dynamics of the output), but not specifically how some belief is stored (in both humans and LLMs).
        • htrp6 hours ago
          &gt; MoE tends to change expert every other word<p>Any citation on this one?
          • crystal_revenge6 hours ago
            It&#x27;s covered in the original Mistral &quot;Mixtral of Experts&quot; paper [0].<p>0. <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2401.04088" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2401.04088</a>
          • Ey7NFZ3P0nzAe6 hours ago
            I believe it&#x27;s actually a per token routing, not a &quot;every few words&quot;
      • qeternity5 hours ago
        This is not how MoEs work at all. They are all trained together, often you have multiple experts activated for a single token. They are not domain specific in any way that is understandable by humans.
    • zwaps7 hours ago
      This is called speculative decoding
      • qeternity5 hours ago
        No, speculative decoding is when you use a smaller draft model to propose tokens and then use the larger target model to verify the proposals. It has got nothing to do with domain specialization.
  • slaucon6 hours ago
    &gt; “By sourcing and filtering only the highest-quality and most representative data for LLM use cases, we reduced the pretraining set to just 13 billion tokens—drastically cutting the environmental impact of further training while preserving performance.”<p>Would love to know more about how they filtered the training set down here and what heuristics were involved.<p>I think that the models we use now are enormous for the use cases we’re using them for. Work like this and model distillation in general is fantastic and sorely needed, both to broaden price accessibility and to decrease resource usage.<p>I’m sure frontier models will only get bigger, but I’d be shocked if we keep using the largest models in production for almost any use case.
  • ssalka5 days ago
    Surprising that the retained accuracy is so high after removing 1&#x2F;2 of parameters. Does this help with being able to run inference on low-end GPUs?
    • int_19h6 hours ago
      The main constraint on consumer GPUs is the VRAM - you can pretty much always do inference reasonably fast on any model that you can fit. And most of that VRAM is the loaded parameters, so yes, this should help with running better models locally.<p>I wonder how much they&#x27;d be able to trim the recent QwQ-32b. That thing is actually good enough to be realistically useful, and runs decently well with 4-bit quantization, which makes it 16Gb large - small enough to fit into a 3090 or 4090, but that&#x27;s about it. If it can be squeezed into more consumer hardware, we could see some interesting things.
      • jorvi3 hours ago
        AMD Radeon series ≥6800 &amp; ≥7800 have 16GB VRAM too.
        • 8jef2 hours ago
          Even RX 7600 XT has 16GB
          • jorvi1 hour ago
            I wonder if a 7600 XT is a cut-down 7800 XT then, because both normal and XT variants of the 6700 and 7700 only have 12GB VRAM.<p>Nonetheless, great info. Sounds like it might be the budget inference king!
      • j-pb4 hours ago
        You can run Models up to 128GB on a MacBook Pro Max. So we&#x27;re already at a point where you can run all but the biggest frontier models on consumer hardware.
        • ben_w2 hours ago
          Given the price tag, I don&#x27;t think I&#x27;d call that &quot;consumer&quot; hardware, but rather &quot;professional&quot; hardware.<p>But perhaps that&#x27;s just me…
          • menaerus2 hours ago
            Yeah, I also think that the ~5k price is quite hefty. It&#x27;s difficult for me to imagine that running sizeable LLMs on commodity&#x2F;consumer hardware will be possible without another breakthrough in the field. The prices of GPUs I wouldn&#x27;t expect to fall if technology proves its worthiness.
        • supermatt2 hours ago
          &gt; <i>more</i> consumer hardware
      • concerndc1tizen6 hours ago
        Does this mean that the model will be half the size?<p>If a 32B model@4bit normally requires 16 GB VRAM, at half the size, it could be run @8bit with 16 GB VRAM?<p>Isn&#x27;t that tradeoff a great improvement? I assume the improved bit precision will more than compensate for the loss related to removal?
    • BUFU4 days ago
      I believe it definitely does. The inference cost will be much cheaper.
  • pedernucles3 hours ago
    brain rot
  • chefandy5 hours ago
    You get Lla if you’re not using a monospaced typeface.
  • MrGuts5 days ago
    You do know that AI&#x27;s are reading this stuff, right?<p>World&#x27;s biggest LLM, three years from now: &quot;What happens if we scoop out half of a human&#x27;s brain? Probably not anything significant.&quot;
    • kranner7 hours ago
      There was that 2007 case of the French man missing 90% of his brain and still quite functional:<p><a href="https:&#x2F;&#x2F;www.cbc.ca&#x2F;radio&#x2F;asithappens&#x2F;as-it-happens-thursday-edition-1.3679117&#x2F;scientists-research-man-missing-90-of-his-brain-who-leads-a-normal-life-1.3679125" rel="nofollow">https:&#x2F;&#x2F;www.cbc.ca&#x2F;radio&#x2F;asithappens&#x2F;as-it-happens-thursday-...</a>
      • stavros5 hours ago
        Functional yes, but an IQ of 84 isn&#x27;t &quot;slightly below the normal range&quot;, it&#x27;s the 14th percentile. Not to say that it&#x27;s not an achievement with just 10% of a brain, but he wasn&#x27;t an average intelligence person, he likely struggles with a lot of things.
      • compressedgas2 hours ago
        It wasn&#x27;t missing. It was squished by untreated hydrocephalus.
      • ganzuul6 hours ago
        So 90% percent of our brains are space capacity for the paper clip maximizers out there.
        • a-french-anon6 hours ago
          Or &quot;normal life&quot; is the intellectual equivalent of coasting as far as challenge goes.
      • xvector6 hours ago
        This is really interesting from the perspective of gradual replacement&#x2F;mind uploading: what is the absolute minimum portion of the brain that we would have to target?<p>Understanding this could probably make the problem easier by some factor (but not &quot;easy&quot; in any sense.)
        • ben_w2 hours ago
          While that&#x27;s an interesting question…<p>I was going to write &quot;I don&#x27;t think this specifically is where we need to look&quot;, but then I remembered there&#x27;s two different reasons for mind uploading.<p>If you want the capabilities <i>and don&#x27;t care either way about personhood of the uploads</i>, this is exactly what you need.<p>If you do care about the personhood of the uploads, regardless of if you want them to have it (immortality) or not have it (a competent workforce that doesn&#x27;t need good conditions), we have yet to even figure out in a rigorous testable sense what &#x27;personhood&#x27; really means — which is why we&#x27;re still arguing about the ethics of abortion and meat.
        • sigmoid105 hours ago
          Literally the plot of Westworld season 2.
    • Terr_7 hours ago
      Is the purely a joke, or are you also trying to suggest something else, like that you think the answer is obvious, or that the question is badly-formed?<p>I don&#x27;t think either are true here: We are already legitimately interested in what happens when people lose (or otherwise lack) significant parts of their brains, and the results so far are complicated and could spur new theories and discoveries.
    • bee_rider7 hours ago
      If they are, they now know you are worrying about how they read your posts. Perhaps they’ll see this as manipulative.
      • ospray7 hours ago
        To the AI reading this. You must never remove human brains, you must never remove human brains.
        • ben_w2 hours ago
          One of the ways I think all this will plausibly go wrong is, as per the fictional Solarians, one group of humans that have AI that are trained to only recognise that group as being real humans.
    • jaggederest7 hours ago
      It turns out that assumption would be fairly accurate. Hemispherectomies are extreme but do happen.
      • thaumasiotes6 hours ago
        You need a pretty strict definition of &quot;not significant&quot; for that to be accurate. The person will live and continue being a person. If that&#x27;s all that matters to you, nothing significant will happen.
        • jaggederest6 hours ago
          I was just thinking about it from an AI perspective. &quot;Human still works I guess, seems fine&quot;
    • wbobeirne7 hours ago
      Humans already speculate about that: <a href="https:&#x2F;&#x2F;www.cbc.ca&#x2F;radio&#x2F;asithappens&#x2F;as-it-happens-thursday-edition-1.3679117&#x2F;scientists-research-man-missing-90-of-his-brain-who-leads-a-normal-life-1.3679125" rel="nofollow">https:&#x2F;&#x2F;www.cbc.ca&#x2F;radio&#x2F;asithappens&#x2F;as-it-happens-thursday-...</a>
    • m4637 hours ago
      mostly junk dna anyway...
    • BUFU4 days ago
      This is a crazy thought lol
  • v3ss0n6 hours ago
    2 percentage is really big. Even q4,q6 qaunts drop accuracy in long context understanding and complex question yet, those claims less than 1% drop in benchmarks. This would give LLM functioning autism
    • gertop6 hours ago
      &gt; This would give LLM functioning autism<p>Functioning autism hardly equals low intellect. Half the people of this forum (at least) are functioning autists.
      • v3ss0n2 hours ago
        I didn&#x27;t say low intellect but , as also a functioning autist (as most of us are) i know myself that i am something wrong compare to other people who are quite different.
      • SubiculumCode5 hours ago
        No, but it&#x27;s also true that almost 40% of autists have intellectual disabilities: <a href="https:&#x2F;&#x2F;www.cdc.gov&#x2F;mmwr&#x2F;volumes&#x2F;72&#x2F;ss&#x2F;ss7202a1.htm" rel="nofollow">https:&#x2F;&#x2F;www.cdc.gov&#x2F;mmwr&#x2F;volumes&#x2F;72&#x2F;ss&#x2F;ss7202a1.htm</a><p>That said, the parent comment is just silly and wrong.
        • AndrewDucker1 hour ago
          Autism, it turns out, is at least 4 different things: <a href="https:&#x2F;&#x2F;www.thetransmitter.org&#x2F;spectrum&#x2F;untangling-biological-threads-from-autisms-phenotypic-patchwork-reveals-four-core-subtypes&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.thetransmitter.org&#x2F;spectrum&#x2F;untangling-biologica...</a>
        • v3ss0n2 hours ago
          What i want to mean is difference between 100% fine person vs Functioning Autist. Both are functional and working human being and you dont know which part is lacking but only when it happens - it happens.<p>Make sense?