After reading the article it seems to me that this is more like synaptic pruning where weak connections between neurons are eliminated in order to increase the efficiency of the neurons. Interesting to see that this also works for LLMs.<p><a href="https://en.wikipedia.org/wiki/Synaptic_pruning" rel="nofollow">https://en.wikipedia.org/wiki/Synaptic_pruning</a>
I might be missing something, but it would be great if the charts would show inference speed, model size (required VRAM) and quality (benchmark results) in one. It might be that the same quality and speed and size can be attained by just quantizing, perhaps with added fine-tuning, without the sparseness. The post seems to imply that their method is better, but if that's the case, they could show that.
All of these smaller model paradigm suggests that we need to incorporate pruning into model training. Neat was one of my favorite algorithms of all time. Same thing with BitNet models which keep showing the information you need is not that much for neural networks. And again, it is same with us, we use much less energy than a regular network so there seems to be immense waste of energy training these models.<p>My intiution tells me the pre-training paradigm will shift immensely in near future because we started to understand that we don’t need all these paramaters since the subnetworks seems to be very robust preserving information in high dimensions. We keep saying curse of dimensionality but it is more like the bliss of dimensionality we keep seeing. Network redundancy still seems to be very high given BitNet is more less comparable to other LLMs.<p>This basically shows over 50% of the neural net is gibberish! The reason being is that the objective function simply does not include it.<p>Again my intiution tells me that neural scaling laws are incomplete as they are because they lack the efficiency parameter that needs to be taken into account (or simply left out due to greed of corporate).<p>And this is what we are seeing as “the wall”.<p>I am no expert in neural network theory nor in math but I would assume the laws should be something in the vicinity of this formulation/simulation:<p><a href="https://colab.research.google.com/drive/1xkTMU2v1I-EHFAjoS86upFb0o8Bw4r6t?usp=sharing" rel="nofollow">https://colab.research.google.com/drive/1xkTMU2v1I-EHFAjoS86...</a><p>and encapsulate shannon’s channel’s capacity. I call them generalized scaling laws since it includes what it should include in the first place: entropy.
LLobotoMy
I don't understand LLMs enough to know if this is a silly question or not.<p>Is it possible to build domain specific smaller models and merge/combine them at query/run time to give better response or performance instead of one large all knowing model that learns everything ?
You might want to look into "task arithmetic" which aims at combining task-specific models post-training. For example:<p><a href="https://proceedings.neurips.cc/paper_files/paper/2023/file/d28077e5ff52034cd35b4aa15320caea-Paper-Conference.pdf" rel="nofollow">https://proceedings.neurips.cc/paper_files/paper/2023/file/d...</a>
I think that's the intuition behind MoE (Mixture of Experts). Train separate subnets for different tasks, train a router that selects which subnets to activate at inference time. Mixtral is a current open model which I believe implements this.
No. MoE tends to change expert every other word. There’s a bit of pattern (like a lot of punctuation to one expert) but it’s not clear what. Nobody understands how or why the router chooses the expert. It’s so early.
It's got nothing to do with words, and many MoEs route to multiple experts per token (the well known Mixtral variants for example activates 2 experts per token).
> Nobody understands how or why the router chooses the expert. It’s so early.<p>Nobody understand how LLM works either.
Is LLM as "early" as MoE ?
> MoE tends to change expert every other word<p>Any citation on this one?
It's covered in the original Mistral "Mixtral of Experts" paper [0].<p>0. <a href="https://arxiv.org/abs/2401.04088" rel="nofollow">https://arxiv.org/abs/2401.04088</a>
I believe it's actually a per token routing, not a "every few words"
This is not how MoEs work at all. They are all trained together, often you have multiple experts activated for a single token. They are not domain specific in any way that is understandable by humans.
This is called speculative decoding
> “By sourcing and filtering only the highest-quality and most representative data for LLM use cases, we reduced the pretraining set to just 13 billion tokens—drastically cutting the environmental impact of further training while preserving performance.”<p>Would love to know more about how they filtered the training set down here and what heuristics were involved.<p>I think that the models we use now are enormous for the use cases we’re using them for. Work like this and model distillation in general is fantastic and sorely needed, both to broaden price accessibility and to decrease resource usage.<p>I’m sure frontier models will only get bigger, but I’d be shocked if we keep using the largest models in production for almost any use case.
Surprising that the retained accuracy is so high after removing 1/2 of parameters. Does this help with being able to run inference on low-end GPUs?
The main constraint on consumer GPUs is the VRAM - you can pretty much always do inference reasonably fast on any model that you can fit. And most of that VRAM is the loaded parameters, so yes, this should help with running better models locally.<p>I wonder how much they'd be able to trim the recent QwQ-32b. That thing is actually good enough to be realistically useful, and runs decently well with 4-bit quantization, which makes it 16Gb large - small enough to fit into a 3090 or 4090, but that's about it. If it can be squeezed into more consumer hardware, we could see some interesting things.
AMD Radeon series ≥6800 & ≥7800 have 16GB VRAM too.
Even RX 7600 XT has 16GB
I wonder if a 7600 XT is a cut-down 7800 XT then, because both normal and XT variants of the 6700 and 7700 only have 12GB VRAM.<p>Nonetheless, great info. Sounds like it might be the budget inference king!
You can run Models up to 128GB on a MacBook Pro Max. So we're already at a point where you can run all but the biggest frontier models on consumer hardware.
Given the price tag, I don't think I'd call that "consumer" hardware, but rather "professional" hardware.<p>But perhaps that's just me…
Yeah, I also think that the ~5k price is quite hefty. It's difficult for me to imagine that running sizeable LLMs on commodity/consumer hardware will be possible without another breakthrough in the field. The prices of GPUs I wouldn't expect to fall if technology proves its worthiness.
> <i>more</i> consumer hardware
Does this mean that the model will be half the size?<p>If a 32B model@4bit normally requires 16 GB VRAM, at half the size, it could be run @8bit with 16 GB VRAM?<p>Isn't that tradeoff a great improvement? I assume the improved bit precision will more than compensate for the loss related to removal?
I believe it definitely does. The inference cost will be much cheaper.
brain rot
You get Lla if you’re not using a monospaced typeface.
You do know that AI's are reading this stuff, right?<p>World's biggest LLM, three years from now: "What happens if we scoop out half of a human's brain? Probably not anything significant."
There was that 2007 case of the French man missing 90% of his brain and still quite functional:<p><a href="https://www.cbc.ca/radio/asithappens/as-it-happens-thursday-edition-1.3679117/scientists-research-man-missing-90-of-his-brain-who-leads-a-normal-life-1.3679125" rel="nofollow">https://www.cbc.ca/radio/asithappens/as-it-happens-thursday-...</a>
Functional yes, but an IQ of 84 isn't "slightly below the normal range", it's the 14th percentile. Not to say that it's not an achievement with just 10% of a brain, but he wasn't an average intelligence person, he likely struggles with a lot of things.
It wasn't missing. It was squished by untreated hydrocephalus.
So 90% percent of our brains are space capacity for the paper clip maximizers out there.
This is really interesting from the perspective of gradual replacement/mind uploading: what is the absolute minimum portion of the brain that we would have to target?<p>Understanding this could probably make the problem easier by some factor (but not "easy" in any sense.)
While that's an interesting question…<p>I was going to write "I don't think this specifically is where we need to look", but then I remembered there's two different reasons for mind uploading.<p>If you want the capabilities <i>and don't care either way about personhood of the uploads</i>, this is exactly what you need.<p>If you do care about the personhood of the uploads, regardless of if you want them to have it (immortality) or not have it (a competent workforce that doesn't need good conditions), we have yet to even figure out in a rigorous testable sense what 'personhood' really means — which is why we're still arguing about the ethics of abortion and meat.
Literally the plot of Westworld season 2.
Is the purely a joke, or are you also trying to suggest something else, like that you think the answer is obvious, or that the question is badly-formed?<p>I don't think either are true here: We are already legitimately interested in what happens when people lose (or otherwise lack) significant parts of their brains, and the results so far are complicated and could spur new theories and discoveries.
If they are, they now know you are worrying about how they read your posts. Perhaps they’ll see this as manipulative.
It turns out that assumption would be fairly accurate. Hemispherectomies are extreme but do happen.
Humans already speculate about that: <a href="https://www.cbc.ca/radio/asithappens/as-it-happens-thursday-edition-1.3679117/scientists-research-man-missing-90-of-his-brain-who-leads-a-normal-life-1.3679125" rel="nofollow">https://www.cbc.ca/radio/asithappens/as-it-happens-thursday-...</a>
mostly junk dna anyway...
This is a crazy thought lol
2 percentage is really big. Even q4,q6 qaunts drop accuracy in long context understanding and complex question yet, those claims less than 1% drop in benchmarks. This would give LLM functioning autism