iPhone 17 Pro Demonstrated Running a 400B LLM

(twitter.com)

90 points by anemll1 hour ago

10 comments

causal58 minutes ago
Run an incredible 400B parameters on a handheld device.0.6 t/s, wait 30 seconds to see what these billions of calculations get us:"That is a profound observation, and you are absolutely right ..."
- intrasight31 minutes ago
 Better than waiting 7.5 million years to have a tell you the answer is 42.
 - thinkingtoilet14 minutes ago
 Maybe you should have asked a better question. :P
 - patapong0 minutes ago
 What do you get if you multiply six by nine?
- WarmWash36 minutes ago
 I don't think we are ever going to win this. The general population loves being glazed way too much.
 - baal80spam31 minutes ago
 > The general population loves being glazed way too much.This is 100% correct!
 - WarmWash19 minutes ago
 Thanks for short warm blast of dopamine, no one else ever seems to grasp how smart I truly am!
 - timcobb11 minutes ago
 That is an excellent observation.
 - tombert10 minutes ago
 That's an astute point, and you're right to point it out.
 - actusual9 minutes ago
 You are thinking about this exactly the right way.
- amelius4 minutes ago
 I mean size says nothing, you could do it on a Pi Zero with sufficient storage attached.So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
- Aurornis21 minutes ago
 I thought you were being sarcastic until I watched the video and saw those words slowly appear.Emphasis on slowly.
firstbabylonian1 hour ago
> SSD streaming to GPUIs this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?1: <a href="https://arxiv.org/abs/2312.11514" rel="nofollow">https://arxiv.org/abs/2312.11514</a>
- simonw1 hour ago
 Yes. I collected some details here: <a href="https://simonwillison.net/2026/Mar/18/llm-in-a-flash/" rel="nofollow">https://simonwillison.net/2026/Mar/18/llm-in-a-flash/</a>
- foobiekr16 minutes ago
 This is not entirely dissimilar to what Cerebus does with their weights streaming.
 - manmal3 minutes ago
 And IIRC the Unreal Engine Matrix demo for PS5 was streaming textures directly from SSD to the engine as well?
- zozbot23455 minutes ago
 A similar approach was recently featured here: <a href="https://news.ycombinator.com/item?id=47476422">https://news.ycombinator.com/item?id=47476422</a> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)
 - Aurornis22 minutes ago
 > Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
 - simonw45 minutes ago
 Yeah, this new post is a continuation of that work.
cj001 hour ago
It’s 400B but it’s mixture of experts so how many are active at any time?
- simonw1 hour ago
 Looks like it's Qwen3.5-397B-A17B so 17B active. <a href="https://github.com/Anemll/flash-moe/tree/iOS-App" rel="nofollow">https://github.com/Anemll/flash-moe/tree/iOS-App</a>
- anshumankmr3 minutes ago
 Aren't most companies doing MoE at this point?
_air40 minutes ago
This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains
- originalvichy18 minutes ago
 On smartphones? It’s not worth it to run a model this size on a device like this. A smaller fine-tuned model for specific use cases is not only faster, but possibly more accurate when tuned to specific use cases. All those gigs of unnecessary knowledge are useless to perform tasks usually done on smartphones.
- Tade021 minutes ago
 Only way to have hardware reach this sort of efficiency is to embed the model in hardware.This exists[0], but the chip in question is physically large and won't fit on a phone.[0] <a href="https://www.anuragk.com/blog/posts/Taalas.html" rel="nofollow">https://www.anuragk.com/blog/posts/Taalas.html</a>
 - intrasight1 minute ago
 I think for many reasons this will become the dominant paradigm for end user devices.Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
ashwinnair991 hour ago
A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.
- cogman101 hour ago
 This isn't a hardware feat, this is a software triumph.They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
 - pdpi1 hour ago
 It's both.We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
 - bigyabai1 minute ago
 > We haven't had phones running laptop-grade CPUs/GPUs for that longAgree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.
 - smallerize57 minutes ago
 The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).
- Aurornis18 minutes ago
 It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
 - zozbot2340 minutes ago
 If the bottleneck is storage bandwidth that's not "slow". It's only slow if you insist on interactive speeds, but the point of this is that you can run cheap inference in bulk on very low-end hardware.
- mannyv42 minutes ago
 The software has real software engineers working on it instead of researchers.Remember when people were arguing about whether to use mmap? What a ridiculous argument.At some point someone will figure out how to tile the weights and the memory requirements will drop again.
 - snovv_crash31 minutes ago
 The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.
pier2550 minutes ago
<a href="https://xcancel.com/anemll/status/2035901335984611412" rel="nofollow">https://xcancel.com/anemll/status/2035901335984611412</a>
- dang8 minutes ago
 Added to toptext. Thanks!
jee5991 hour ago
[dead]
anemll1 hour ago
[flagged]
- lostmsu1 hour ago
 This has nothing to do with Apple, and everything to do with MoE and that everyone forgot you can re-read the necessary bits of the model from disk for each token.This is extremely inefficient though. For efficiency you need to batch many requests (like 32+, probably more like 128+), and when you do that with MoE you lose the advantage of only having to read a subset of the model during a single forward pass, so the trick does not work.But this did remind me that with dense models you might be able to use disk to achieve high throughput at high latency on GPUs that don't have a lot of VRAM.
rwaksmunski1 hour ago
Apple might just win the AI race without even running in it. It's all about the distribution.
- dzikimarian36 minutes ago
 Because someone managed to run LLM on an iPhone at unusable speed Apple won AI race? Yeah, sure.
 - naikrovek31 minutes ago
 whoa, save some disbelief for later, don't show it all at once.
- raw_anon_11111 hour ago
 Apple is already one of the winners of the AI race. It’s making much more profit (ie it ain’t losing money) on AI off of ChatGPT, Claude, Grok (you would be surprised at how many incels pay to make AI generated porn videos) subscriptions through the App Store.It’s only paying Google $1 billion a year for access to Gemini for Siri
 - detourdog54 minutes ago
 Apple’s entire yearly capex is a fraction of the AI spend of the persumed AI winners.
 - foobiekr18 minutes ago
 Fantasy buildouts of hundreds of billions of dollars for gear that has a 3 year lifetime may be premature.Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.
 - devmor35 minutes ago
 Which is mostly insane amounts of debt leveraged entirely on the moonshot that they will find a way to turn a profit on it within the next couple years.Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.
 - qingcharles36 minutes ago
 Plus all those pricey 512GB Mac Studios they are selling to YouTubers.
 - icedchai12 minutes ago
 They don't offer the 512 gig RAM variant anymore. Outside of social media influencers and the occasional AI researcher, the market for $10K desktops is vanishingly small.
 - Multiplayer0 minutes ago
 My understanding is that the 512gb offering will likely return with the new M5 Ultra coming around WWDC in June. Fingers crossed anyway!
simopa1 hour ago
It's crazy to see a 400B model running on an iPhone. But moving forward, as the information density and architectural efficiency of smaller models continue to increase, getting high-quality, real-time inference on mobile is going to become trivial.
- volemo5 minutes ago
 > moving forward, as the information density and architectural efficiency of smaller models continue to increaseIf they continue to increase.