Zebra-Llama – Towards efficient hybrid models

(arxiv.org)

113 points by mirrir62 days ago

8 comments

adityashankar62 days ago
Due to perverse incentives and the historical nature of models over-claiming accuracy, it's very hard to believe anything until it is open source and can be tested outthat being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne*previously miswrote and said computational efficiency will go down
- credit_guy62 days ago
 Like this?<a href="https://huggingface.co/amd/Zebra-Llama-8B-8MLA-24Mamba-SFT" rel="nofollow">https://huggingface.co/amd/Zebra-Llama-8B-8MLA-24Mamba-SFT</a>
 - jychang62 days ago
 Or like this: <a href="https://api-docs.deepseek.com/news/news251201" rel="nofollow">https://api-docs.deepseek.com/news/news251201</a>I don't know what's so special about this paper.- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?
 - twotwotwo62 days ago
 These are potentially complementary approaches. Various innovations have shrunk the KV cache size or (with DSA) how much work you have to do in each attention step. This paper is about hybrid models where some layers' state needs don't grow with context size at all.SSMs have a fixed-size state space, so on their own they'll never going be able to recite a whole file of your code in a code-editing session for example. But if much of what an LLM is doing isn't long-distance recall, you might be able to get away with only giving some layers full recall capability, with other layers manipulating the info already retrieved (plus whatever's in their own more limited memory).I think Kimi Linear Attention and Qwen3-next are both doing things a little like this: most layers' attention/memory doesn't grow with context size. Another approach, used in Google's small open Gemma models, is to give some layers only 'local' attention (most recent N tokens) and give a few 'full' (whole context window) attention. I guess we're seeing how those approaches play out and how different tricks can be cobbled together.There can potentially be a moneyball aspect to good model architecture. Even if on its own using space-saving attention mechanisms in some layers of big models cost something in performance, their efficiency could allow you to 'spend' more elsewhere (more layers or more params or such) to end with overall better performance at a certain level of resources. Seems like it's good to have experiments with many different approaches going on.
 - T-A62 days ago
 From your link: DeepSeek-V3.2 Release 2025/12/01From Zebra-Llama's arXiv page: Submitted on 22 May 2025
 - Palmik62 days ago
 DeepSeek's MLA paper was published in 2024: <a href="https://arxiv.org/abs/2405.04434" rel="nofollow">https://arxiv.org/abs/2405.04434</a>DeepSeek's Sparse Attention paper was published in February: <a href="https://arxiv.org/abs/2502.11089" rel="nofollow">https://arxiv.org/abs/2502.11089</a>DeepSeek 3.2 Exp (combining MLA and DSA) was released in September.You also had several other Chinese hybrid models, like Qwen3 Next and Minimax M1.
 - jychang62 days ago
 That's still behind the times. Even the ancient dinosaur IBM had released a Mamba model [1] before this paper was even put out.> Granite-4.0-Tiny-Base-Preview is a 7B-parameter hybrid mixture-of-experts (MoE) language model featuring a 128k token context window. The architecture leverages Mamba-2, superimposed with a softmax attention for enhanced expressiveness, with no positional encoding for better length generalization. Release Date: May 2nd, 2025I mean, good for them for shipping, I guess. But seriously, I expect any postgrad student to be able to train a similar model with some rented GPUs. They literally teach MLA to undergrads in the basic LLM class at Stanford [2] so this isn't some exactly some obscure never-heard-of concept.[1] <a href="https://huggingface.co/ibm-granite/granite-4.0-tiny-base-preview" rel="nofollow">https://huggingface.co/ibm-granite/granite-4.0-tiny-base-pre...</a>[2] <a href="https://youtu.be/Q5baLehv5So?t=6075" rel="nofollow">https://youtu.be/Q5baLehv5So?t=6075</a>
 - nickpsecurity62 days ago
 "Deepseek hasn't done a full $5.6mil full "Don't forget the billion dollars or so of GPU's they had access to that they left out of that accounting. Also, the R&D cost of the Meta model they originally used. Then, they added $5.6 million on top of that.
 - credit_guy62 days ago
 Here's what's important about this paper. It is written by AMD researchers. It shows AMD is investing in AI research. Is this the same level of achievement as DeepSeek 3.2. Most likely not. Do they have novel ideas? Difficult to say, there are hundreds of new ideas being tried in this space. Is this worthless? Most certainly not. In order to make progress in this domain (as in any other), you first need to get your feet wet. You need to play with the various components, and see how they fit together. The idea in this paper is that you can combine somehow SSMs (like Mamba) and LLMs (like LLama). The examples they give are absolute toys compared to DeepSeek 3.2 (the largest is 8 billion parameters, while DeepSeek 3.2 has 671 billion parameters). The comparison you are trying to make simply does not apply. The good news for all of us is that AMD is working in this space.
 - jychang62 days ago
 Mamba based LLMs aren't even close to novel though. IBM's been doing this since forever [1].Also, you're off on Deepseek V3.2's param count, the full model's 685B in size with the MTP layer.I don't think there's anything interesting here other than "I guess AMD put out a research paper", and it's not cutting edge when Deepseek or even IBM is running laps around them.[1] Here's a news article from April, although IBM has been doing it for a long time before that <a href="https://research.ibm.com/blog/bamba-ssm-transformer-model" rel="nofollow">https://research.ibm.com/blog/bamba-ssm-transformer-model</a>
 credit_guy61 days ago
 It's not cutting edge, so what? Your point is that nobody should publish anything unless it is cutting edge?
 jychang61 days ago
 Yeah, that's the point of publishing. You get scooped, you lose.
 credit_guy61 days ago
 This wasn’t published, it was just posted to the arxiv.
 YouAreWRONGtoo62 days ago
 [dead]
 - SilverElfin62 days ago
 How did you get all this info about how each is trained? Is that something they admit now or is it through leaks?
 - jychang62 days ago
 Deepseek? It's literally in their research papers.OpenAI? The OpenAI head of research @markchen90 straight up admitted it in a podcast.<a href="https://x.com/petergostev/status/1995744289079656834" rel="nofollow">https://x.com/petergostev/status/1995744289079656834</a>"In the last 2 years we've put so much resourcing into, into reasoning and one byproduct of that is you lose a little bit of muscle on pre training and post training." "In the last six months, @merettm and I have done a lot of work to build that muscle back up." "With all the focus on RL, there's an alpha for us because we think there's so much room left in pre training." "As a result of these efforts, we've been training much stronger models. And that also gives us a lot of confidence carrying into Gemini 3 and other releases coming this end of the year."Note, "alpha" in the quote above is referring to <a href="https://en.wikipedia.org/wiki/Alpha_(finance)" rel="nofollow">https://en.wikipedia.org/wiki/Alpha_(finance)</a>But it's pretty clear that the last full pretrain run they've released is for gpt-4o 2 years ago*, and since then they've just been iterating RL for their models. You don't need any insider information to notice that, it's pretty obvious.*Excluding GPT-4.5 of course, but even OpenAI probably wants us to forget about that.
 nl62 days ago
 Semi-analysis also believes they haven't done a fill pretraining run since 4o (except for GPT-4.5): <a href="https://open.substack.com/pub/semianalysis/p/tpuv7-google-takes-a-swing-at-the?selection=578c2a78-53b1-41bb-9551-e0c481241824&utm_campaign=post-share-selection&utm_medium=web&aspectRatio=instagram&textColor=%23ffffff&bgImage=true" rel="nofollow">https://open.substack.com/pub/semianalysis/p/tpuv7-google-ta...</a>
 - deepdarkforest62 days ago
 > which does pose interesting questions over nvidia's throne...> Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that...Hmmm
 - adityashankar62 days ago
 yes!, thanks for the link!
 - moffkalast62 days ago
 GGUF when? /s
- ACCount3762 days ago
 I don't doubt the increase in efficiency. I doubt the "drastically".We already see models become more and more capable per weight and per unit of compute. I don't expect a state-change breakthrough. I expect: more of the same. A SOTA 30B model from 2026 is going to be ~30% better than one from 2025.Now, expecting that to hurt Nvidia? Delusional.No one is going to stop and say "oh wow, we got more inference efficiency - now we're going to use less compute". A lot of people are going to say "now we can use larger and more powerful models for the same price" or "with cheaper inference for the same quality, we can afford to use more inference".
 - colechristensen62 days ago
 Eh.Right now, Claude is good enough. If LLM development hit a magical wall and never got any better, Claude is good enough to be terrifically useful and there's diminishing returns on how much good we get out of it being at $benchmark.Saying we're satisfied with that... well how many years until efficiency gains from one side and consumer hardware from the other meet in the middle so "good enough for everybody" open models are available for anyone who wants to pay for a $4000 MacBook (and after another couple of years a $1000 MacBook, and several more and a fancy wristwatch).Point being, unless we get to a point where we start developing "models" that deserve civil rights and citizenship, the years are numbered to where we NEED cloud infrastructure and datacenters full of racks and racks of $x0,000 hardware.I strongly believe the top end of the S curve is nigh, and with it we're going to see these trillion dollar ambitions crumble. Everybody is going to want a big-ass GPU and a ton of RAM but that's going to quickly become boring because open models are going to exist that eat everybody's lunch and the trillion dollar companies trying to beat them with a premium product aren't going to stack up outside of niche cases and much more ordinary cloud compute motivations.
 - ACCount3762 days ago
 Good enough? There's no such thing.People said that "good enough" about GPT-4. Now you say that about Claude Opus 4.5. How long before the treadmill turns, and the very same Opus 4.5 becomes "the bare minimum" - the least capable AI you would actually consider using for simple and unimportant tasks?We have miles and miles of AI advancements ahead of us. The end of that road isn't "good enough". It's "too powerful to be survivable".
 - colechristensen62 days ago
 I can build fully functional applications without writing a single line of code with Claude. In my free time. On a weekend. I'm going to release one of them pretty soon. A toddler being able to do this instead of an industry veteran isn't that compelling. Avoiding the few pitfalls of the LLM getting stuck and taking a while to get out isn't that valuable.>Good enough? There's no such thing.This is just wrong. Maybe you can't imagine good enough, I can. And I think "better" is going to start getting diminishing returns as the velocity of improvements I expect to slow and the value of improvements are going to become less meaningful. The "cost" of a LLM making mistakes is already pretty low, cutting it in half is better, sure, but it's so low already I don't particularly care if it gets some multiple more rare.
 grogers62 days ago
 LLM only fairly recently underwent a step change from "maybe someday" to actually useful now. That opened many new doors that people didn't even think were possible. Getting incrementally better at something they are already pretty good at isn't that impressive. But getting drastically better at something they are currently bad at, will drive new models and new research.
 - cyanydeez62 days ago
 Elon will boil the oceans if it means not having to deal with poor people.
 cindyllm62 days ago
 [dead]
 - YouAreWRONGtoo62 days ago
 [dead]
 - buu70062 days ago
 Coding capability in and of itself may be "good enough" or close to it, but there's a long way to go before AI can build and operate a product end-to-end. In fairness, a lot of the gap may be tooling.But the end state in my mind is telling an AI "build me XYZ", having it ask all the important questions over the course of a 30-minute chat while making reasonable decisions on all lower-level issues, then waking up the next morning to a live cloud-hosted test environment at a subdomain of the domain it said it would buy along with test builds of native apps for Android, iOS, Linux, macOS, and Windows, all with near-100% automated test coverage and passing tests. Coding agents feel like magic, but we're clearly not there yet.And that's just coding. If someone wanted to generate a high-quality custom feature-length movie within the usage limits of a $20/mo AI plan, they'd be sorely disappointed.
 - nl62 days ago
 > But the end state in my mind is telling an AI "build me XYZ", having it ask all the important questions over the course of a 30-minute chat while making reasonable decisions on all lower-level issues, then waking up the next morning to a live cloud-hosted test environment at a subdomain of the domain it said it would buy along with test builds of native apps for Android, iOS, Linux, macOS, and Windows, all with near-100% automated test coverage and passing tests. Coding agents feel like magic, but we're clearly not there yet.If you take out the native builds we are there now. V0, Lovable, etc really do a great job of this. If you want an IDE-like environment Antigravity is pretty good too.The native builds thing is completely doable too. I've built cross platform apps in 30 minutes of my time using Codex+Flutter. It really does work.
 - colechristensen62 days ago
 >But the end state in my mind is telling an AI "build me XYZ", having it ask all the important questions over the course of a 30-minute chat while making reasonable decisions on all lower-level issues, then waking up the next morning to a live cloud-hosted test environment at a subdomain of the domain it said it would buy along with test builds of native apps for Android, iOS, Linux, macOS, and Windows, all with near-100% automated test coverage and passing tests. Coding agents feel like magic, but we're clearly not there yet.I'm pretty sure we're there. I'm not sure how interested I am in completely closing that loop and completely removing the human from the loop. But I'm also pretty confident that I could do it with nothing but existing models and software built around them.
 buu70062 days ago
 I'm not aware that we are there, but would be very interested if you have information to the contrary. Even if we were there, the product/service that does it would have to be at a reasonable cost in order to be useful for most people.As I said, a lot of the gap may be tooling. But I'm skeptical that even the models themselves are capable of that given sufficiently advanced tooling. I'm not saying we're not close (certainly much closer than we were at the start of the decade), but if we were actually there, you would have zero reservations about removing the human from the loop of an initial prototype.
 colechristensen62 days ago
 My information to the contrary is my experience in the last few weeks building things with LLMs including tooling to help build things with LLMs. The is experience is one of ... I'm a product manager and devsecops engineer bullying an LLM with the psychology of a toddler into building great software which it can do very successfully. A single instance of a model with a single rolling context window and one set of prompts absolutely can't do what you want, but that's not what I've been doing.Oneshotting applications isn't interesting to me because I do want to be involved, there are things I have opinions about that I won't know I will have until we get there and there are definitely times where I want to pivot a little or a lot in the middle of development based on experience, an actually agile development cycle.In the same way I wouldn't want to hire a wedding planner or house builder to plan my wedding or build my home based entirely on a single short meeting before anything started, I don't want to one shot software.There are all sorts of things where I want to get myself out of the loop because they're stupid problems, some of them I've fixed, others I'd rather fix later because doing the thing is more interesting than pausing and building the tools to make the thing.There is I think an inverse relationship between the complexity of the tooling and the amount of human involvement; for me I've reached or am quite near the amount of human involvement where I'm much more excited about building stuff than saving more of my attention.I'm being a bit vague because I'm not sure I want to share all of my secrets just yet.
 buu70062 days ago
 Just to be clear, what I was proposing was a single tool which would, on the basis of a single ~30-minute interaction, purchase a domain name, set up a cloud environment, build a full-stack application + cross-platform native apps + useful tests with near-100% coverage, deploy a live test environment, and compile each platform's native app — all entirely autonomously. Are you saying you've used or built something similar to that? That is super interesting if so, even if you're unable to share. A major subset of that could also still be incredibly useful, but the whole solution I described is a very high bar.I've been very successful building with custom LLM workflows and automation myself, but that's beyond the capabilities of any tooling I've seen, and I wouldn't necessarily expect great results with current models even if current tooling were fully capable of what I described. Even with such tooling, the cost of inference is high enough to deter careless usage without much more rigorous work on the initial spec and/or micromanagement of the development process.I'm not necessarily advocating for one-shotting in any given context. I'm simply pointing out that there would be huge advantages to LLMs and tooling sufficiently advanced to be fully capable of doing so end-to-end, especially at dramatically lower cost than current models and at superhuman quality. Such an AI could conceivably one-shot any possible project idea, in the same sense that a competent human dev team with nothing but a page of vague requirements and unlimited time could at least eventually produce something functional.The value of such an AI is that we'd use it in ways that sound ridiculous today. Maybe a chat with some guy at a bar randomly inspires a neat idea, so you quickly whip out your phone and fire off some bullet point notes; by the time you get home, you have 10 different near-production-ready variations to choose from, each with documentation on the various decisions its agent made and why, and each one only cost $5 in account credit. None is quite perfect, but through the process you've learned a lot and substantially refined the idea; you give it a second round of notes and wake up to a new testable batch. One of those has the functional requirements just right, so you make the final decisions on non-functional requirements and let it roll one last time with strict attention to detail on code quality and a bunch of cycles thrown at security review.That evening, you check back in and find a high-quality final implementation that meets all of your requirements with a performant and scalable architecture, with all infrastructure deployed and apps submitted to all stores/repositories. You subsequently allocate a sales and marketing budget to the AI, and eventually notice that you suddenly have a new source of income. Now imagine that instead of you, this was actually your friend who's never written a line of code and barely knows how to use a computer.I still agree with you that current models have been "good enough" for some time, in the sense that if LLMs froze today we could spend the next decade collectively building on and with them and it would totally transform the economy. But at the same time, there's definitely latent demand for more and/or better inference. If LLMs were to become radically more efficient, we wouldn't start shuttering data centers; the economy would just become that much more productive.
 nl62 days ago
 Have you tried Loveable, Replit, V0 etc?Outside of purchasing the domain and native apps for you they cover a very significant amount of this.If you insist on Native Apps, it's possible Google Jules could do it. With Gemini 2.5 it wasn't strong enough but I think it has Gemini 3 now which can definitely do native apps just fine.
 buu70061 days ago
 Thanks for the recommendations. Regarding your other comment, Flutter is what I've landed on as well for my next cross-platform app project, and I'm currently in the middle of developing a spec for a fairly complex agentic system that I'm going to try having Codex two-shot (basic project setup + file stubs + exhaustive tests -> manual checkpoint -> TDD the rest).I haven't tried Lovable, V0, or Jules, but I really like Replit for certain things. Having said that, based on my experience, I would characterize it as an amazing tool for rapid frontend iteration with prototype-level backend creation. I'm sure it's gotten better at one-shotting since I tried Agent 2 with Sonnet 3.7 in May, but would still be very (pleasantly) surprised to see that Agent 3 with current models could meet the incredibly high bar of wholly replacing a human dev team.The fact that tools like Replit also include their own hosting environments is definitely neat, but not really what I was getting at as far as deployment. What I had in mind was managing arbitrary cloud platforms, setting up an optimal architecture for your anticipated scale and usage patterns — whether that's a single Hetzner instance with SQLite or horizontally scaled app servers behind an API gateway with Kafka, Valkey, and Spanner or ScyllaDB — and doing all the DevOps to handle that along with things like CI/CD.I'm not downplaying how amazing these capabilities are. Being able to generate high-quality code from natural language feels like magic. But all the parts beyond narrow application code are half of the thing I described:* I'm saying you should be able to send a single off-the-cuff drunk text to an AI and later find a complete production-ready SaaS startup that fully aligns with a reasonable interpretation of your message.* The other half of the whole thing is >=human-level execution. If the AI can't autonomously deliver work comparable to what an experienced CTO would (given the same requirements, an arbitrarily large hiring budget, and a stipulation to never contact you again until the work was done), it's not there yet.Again, none of this is to dunk on agentic coding. My point is that I set an absurdly high bar because I want it to one day be met. Just as a $100 storage budget today is equivalent to $100m a few decades ago, I want to live to see a $100 engineering budget reach equivalency with last decade's $100m.
 nl61 days ago
 If you haven't tried these things since Sonnet 4.5 came out then it's time to give them another try.Sonnet 4.5 and especially Codex 5.1 have completely changed the way I build software.> The fact that tools like Replit also include their own hosting environments is definitely neat, but not really what I was getting at as far as deployment. What I had in mind was managing arbitrary cloud platforms, setting up an optimal architecture for your anticipated scale and usage patterns — whether that's a single Hetzner instance with SQLite or horizontally scaled app servers behind an API gateway with Kafka, Valkey, and Spanner or ScyllaDB — and doing all the DevOps to handle that along with things like CI/CD.I think this is all possible now. But I don't think it'd work first time because there are so many environmental issues (service auth etc) that can go wrong. Maybe it'd be ok if you have it a root AWS account...
 buu70061 days ago
 Just in case it was unclear, I extensively use AI and agentic coding with current models on a daily basis. The only thing I haven't tried in a few months is specifically one-shotting a greenfield project.I know computer-use agents exist, and theoretically have tooling and permission to do all the things a human sitting in front of a computer can. I just haven't heard of anyone successfully claiming to have had one do exactly what I described for a non-toy project in one shot with zero mistakes, or of any tool like Replit claiming to support such a capability.I'd be very interested to know if my impression is out of date. As in, if I could send a single message to some AI service and say "Here's my credit card, banking info, and entity info/EIN; build me a production-ready Google Drive clone with religious branding and 10x higher pricing called God Drive with native Android/iOS/Linux/macOS/Windows apps, then deploy it to production on an optimal cloud architecture capable of scaling to a billion users at whatever domain name you like best and release the apps to all major app stores/repositories", then go to bed with high confidence that I'd be able to start creating God Drive docs/spreadsheets/presentations for work the following morning.If that isn't the case, it isn't a criticism of the technology. The fact that we're even seriously discussing the scenario is incredible.
 colechristensen61 days ago
 Well... they're not oracles and never will be. The things I'm creating are following recognizable development practices. It's not build-once and done, it's an elaborate design/build/test cycle that happens in many flavors because unless you've already done something and are copying it, that's how you create and language models aren't going to get away from that.
 buu70061 days ago
 Whether or not it will one day get there is anyone's guess, but it sounds like we agree that it at least isn't currently there. I brought up that goalpost to illustrate why more efficient models will only improve the aggregate volume and/or quality of output for the foreseeable future, as opposed to creating a glut of supply that destroys the economics of data centers.
 - sureglymop62 days ago
 Given that natural language is ambiguous, what if the LLM makes some mistakes though?I'm wondering because, it's not like it's a human that can then take accountability/responsibility for that...
 buu70061 days ago
 I'd say the simple answer is that the buck always stops with the vendor. If Acme Co sells a jetpack that explodes and kills someone, the entity doesn't get to deflect liability by saying one of their engineers made a mistake; that may be an explanation, but not an excuse. Swapping out Acme Co and its employees with your grandma and her AI/robots doesn't change the fundamental principle.
- danielbln62 days ago
 I think you mean computational efficiency will go _up_ in the future. To your last point: Jevons paradox might apply.
 - adityashankar62 days ago
 yup that's what I meant!, Jevon's paradox applies to resource usage in general and not towards a specific companies dominanceif computational efficiency goes up (thanks for the correction), and CPU inference becomes viable for most practical applications, GPUs (or accelerators) themselves may be unnecessary for most practical functions
 - atq211962 days ago
 Discrete GPUs still have an advantage in memory bandwidth. Though this might push platforms like laptops towards higher bandwidths, which would be nice.
a_wild_dandan62 days ago
If the claims in the abstract are true, then this is legitimately revolutionary. I don’t believe it. There are probably some major constraints/caveats that keep these results from generalizing. I’ll read through the paper carefully this time instead of a skim and come back with thoughts after I’ve digested it.
- jychang62 days ago
 What's not to believe? Qwerky-32b has already done something similar as a finetune of QwQ-32b but not using traditional attention architecture.And hybrid models aren't new, MLA based hybrid models is basically just Deepseek V3.2 in a nutshell. Note that Deepseek V3.2 (and V3.1, R1, and V3... and V2 actually) all use MLA. Deepseek V3.2 is what adds the linear attention stuff.Actually, since Deepseek V3.1 and Deepseek V3.2 are just post-training on top of the original Deepseek V3 pretrain run, I'd say this paper is basically doing exactly what Deepseek V3.2 did in terms of efficiency.
 - cubefox62 days ago
 DeepSeek-V3.2 is a sparse attention architecture, while Zebra-Llama is a hybrid attention/SSM architecture. The outcome might be similar in some ways (close to linear complexity) but I think they are otherwise quite different.
xer62 days ago
This is great! But what if the US invests 1% of GDP in GPU datacenters and then those are not needed becaues someone created a much more efficient architecture?
- wild_egg62 days ago
 More efficiency just means more consumption. Think when they add lanes to a highway, traffic gets better for a little bit but very soon the highway is just as congested as before.
 - wilg62 days ago
 More people get where they’re going in the same amount of time though
- dkural62 days ago
 Look up Jevons Paradox, when something becomes more efficient, consumption can goes up, often due to price elasticity.Think of like this: Imagine car prices go from $200,000 to $$20,000 - you wouldn't sell 10x the amount of cars, you'd sell --- In fact I just looked up the numbers - worldwide only 100K or so cars are 200K & higher, whereas roughly 80 million cars are in that affordable category.So a price drop of 90% allowed sales to go from 0.1M to 80M!! I think this means we need more engines, tires, roads, gas, spare parts.
- chpatrick62 days ago
 Then they'll be able to use those datacenters much more efficiently.
- _boffin_62 days ago
 They will still use capacity. Why would you believe anything different?
Reubend62 days ago
It would be REALLY cool to see this same technique applied to a much more recent OSS model distillation. For example, Mistral 3 14B would be a great target. How efficient can we get inference there?
AlexCoventry62 days ago
This is from May 2025, according to the arxiv watermark. Maybe that should be mentioned in the title.
KnuthIsGod62 days ago
Looks like the trillions of dollars spent on datacentres will end up being regretted.
- pryelluw62 days ago
  I should have been an electrician.
mason_mpls62 days ago
> Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7–11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size—down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively—while preserving 100%, 100%, and 97% of average zero-shot performance on LM Harness tasks.This is an extraordinary claim, is there a catch I’m missing? Am I misreading?
- jychang62 days ago
 The catch that you're missing is that Deepseek did this ages ago.They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.
 - erichocean62 days ago
 Kimi K2 also uses MLA, and Kimi Linear runs Kimi Delta Attention (it's SSM-like) for three out of every four layers (the fourth uses MLA).
 - jychang62 days ago
 Kimi K2 is literally a "copy Deepseek's homework" model. Seriously. It's even exactly 61 layers, the same as Deepseek V3/R1.
 - logicprog62 days ago
 For a "copy Deepseek's homework" model, it's really good, preferable to DeepSeek for me (at least prior to V3.2, which I haven't been able to fully put through its paces yet). post-training really makes that much of a difference I guess
 - storus62 days ago
 Linear attention is really bad, it's only good for benchmaxing but it leads to a loss of valuable granularity, which can be felt in the latest DeepSeek randomly forgetting/ignoring/correcting explicitly stated facts in the prompt.