Hi Intel, I'm itching to buy an Xe3P! Or, Nova Lake? Crescent Island? Celestial? Jaguar Shores?<p>Whatever the hell you name it doesn't matter to me, I just want a workstation with one of them bad boys attached to 160GB of RAM for legit inference power!<p>I've been saving my money not paying for Claude Code so I can run my own agentic coding setup at home on yours. Please don't charge too much for the workstation class card if you can at all manage it. Maybe give us a discount to preorder? Please don't price a regular consumer like me out of the market!<p>Also, I am speculating integer based models will become hot due to lower memory and power requirements. Will the Xe3P be able to do integer-based math inference to use all that RAM to even greater effect?
Time to first token is a very important performance metric, as I figured out using a Mac Studio M3 Ultra (that is quite slow on this aspect).<p>But 32GB for a TDP of 230W is perhaps not super interesting. Especially because you probably want to have more than one card. It's a lot of heat. You could use the cards for heating up a building, but heatpumps exist.
Intel Arc B70 when released, can only produce 1/3 of the token of RTX PRO 4500. Well, it also cost 1/3 of RTX PRO 4500.<p>It lacked software support the for the primary target application, running LLM. The officially supported vllm fork is 6 version behind mainline. It did not run the latest hot new open models on huggingface. Parallel two of B70 reduce token rate, not improve it. So, the software behind B70 is basically so far behind.
There's a tradeoff between dense models and MoEs on memory usage vs. compute for the same quality.<p>For example, Qwen3.5 27B and Qwen3.5 122B A10B have similar average performance across benchmarks. The 122B is much faster to run than the 27B (generates more tokens at the same compute). The 27B, on the other hand, uses ~4x less VRAM at low context lengths (less difference at high context lengths).<p>Right now, different hardware seems to be suited to different points in the dense vs. MoE balance. On one extreme is hardware like the DGX Spark and Strix Halo which have a lot of memory compared to compute performance and memory bandwidth, and are best-suited for MoE workflows. On the other extreme you have cards like RTX 5090 which have very high performance for the price but rather little memory, and is best suited for dense models.<p>The Arc Pro B70 seems to be the awkward middle. With 1-2 of these, you can run a ~30B dense model slowly, probably not fast enough to be useful interactively (you'd probably need a 5090 or 2x 3090 for that). Or, you can run a MoE model at high throughput, but probably not enough quality to support agentic workflows that actually use your throughput.
LLMs are memory bandwidth bound not compute bound.
DGX Spark is at the compute level of 5070. Its main issue is low memory bandwidth, i.e. it has quite fast token prefill but awful token generation. Strix Halo is just slow on every metric and used to be a cheap way to get 128GB unified RAM (now its prices are comparable to DGX Spark).
I am working mostly with image models so we do a lot of fun times and the card fits perfectly here. Performance isn't great but it can just tug along in the background.àp
I still not see the point running these models. I say they produce plausible garbage, nowhere near quality of frontier models (when they work).<p>Why can't Intel look beyond this nonsense state of affair and build something with 1TB of RAM or more?<p>What I am trying to say, I am yet to see anything competitive in the market. Cards very much stalled in sub 100GB region and best corporations can do is throw something to run toy models and forget about it after a week.
Here are some llama.cpp benchmarks for it: <a href="https://www.phoronix.com/review/intel-arc-pro-b70-linux/3" rel="nofollow">https://www.phoronix.com/review/intel-arc-pro-b70-linux/3</a>
Just ran llama-bench at home with the similar priced AMD AI PRO R9700 32G. The phoronix numbers look extremely low? Probably I misunderstand their test bench. Anyway, here are some numbers. Maybe someone with access to a B70 can post a comparison.<p>Tried to use the same model as the article:<p>llama-bench -m gpt-oss-20b-Q8_0.gguf -ngl 999 -p 2048 -n 128<p>AMD R9700 pp2048=3867 tg128=175<p>And a bigger model, because testing a tiny model with a 32GB card feels like a waste:<p>llama-bench -m Qwen3.6-27B-UD-Q6_K_XL.gguf -ngl 999 -p 2048 -n 128<p>AMD R9700 pp2048=917 tg128=22
As of b8966, it is still not great.<p><pre><code> | model | size | params | backend | ngl | test | t/s |
| --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL | 999 | pp2048 | 851.81 ± 6.50 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | SYCL | 999 | tg128 | 42.05 ± 1.99 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | pp2048 | 2022.28 ± 4.82 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 999 | tg128 | 114.15 ± 0.23 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | SYCL | 999 | pp2048 | 299.93 ± 0.40 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | SYCL | 999 | tg128 | 14.58 ± 0.06 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | Vulkan | 999 | pp2048 | 581.99 ± 0.86 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | Vulkan | 999 | tg128 | 10.64 ± 0.12 |
</code></pre>
Edit: I've no idea why one would use gpt-oss-20b at Q8, but the result is basically the same:<p><pre><code> | model | size | params | backend | ngl | test | t/s |
| --------------------- | --------: | ------: | ------- | --: | -----: | -------------: |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | SYCL | 999 | pp2048 | 854.16 ± 6.06 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | SYCL | 999 | tg128 | 44.02 ± 0.05 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 999 | pp2048 | 2022.24 ± 6.97 |
| gpt-oss 20B Q8_0 | 11.27 GiB | 20.91 B | Vulkan | 999 | tg128 | 114.02 ± 0.13 |
</code></pre>
Hopefully, support for the B70 will continue to improve. In retrospect, I probably should have bought a R9700 instead...
"I've no idea why one would use gpt-oss-20b at Q8" - would you mind expanding on this comment?<p>In that particular model family, the choices are 20B and 120B, so 20B higher quant fits in VRAM, while you'd be settling for 120B at a lower quant. Is it that 20B MXFP4 is comparable in performance so no need for Q8?<p>Or is the insight simply that there are better models available now and the emphasis is on gpt-oss-20b, not Q8?
For reference in case it's interesting to someone, a 5090 on Windows 11 with CUDA 13.1<p><pre><code> | model | size | params | backend | ngl | test | t/s |
| --------------------- | ---------: |--------: | -------- | --: |------: |----------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 999 | pp2048 | 10179.12 ± 52.86 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 999 | tg128 | 326.82 ± 7.82 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | CUDA | 999 | pp2048 | 3129.92 ± 5.12 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | CUDA | 999 | tg128 | 53.45 ± 0.15 |
build: 9d34231bb (8929)
gpt-oss-20b-MXFP4.gguf
Qwen3.6-27B-UD-Q6_K_XL.gguf
</code></pre>
Using MXFP4 of GPT-OSS because it was trained quantization-aware for this quantization type, and it's native to the 50xx.
the build they use is from February, over two months old: <a href="https://github.com/ggml-org/llama.cpp/releases/tag/b8121" rel="nofollow">https://github.com/ggml-org/llama.cpp/releases/tag/b8121</a><p>Which might not sound like much, but 2months in llm time is a long time, especially regarding support for new hardware like the r9700.
Also from phoronix, a comparison with AMD R9700 and RTX 6000 Ada (because Nvidia has not sent them a blackwell card): <a href="https://www.phoronix.com/review/intel-arc-pro-b70/2" rel="nofollow">https://www.phoronix.com/review/intel-arc-pro-b70/2</a>
For those that use Blender, in their section about Blender:<p>> We hope that, in the future, there will be real options other than NVIDIA for GPU-based rendering, as it is an area where competition is nearly non-existent.<p>And Checking opendata.blender.org, a NVIDIA GeForce RTX 4080 Laptop GPU scores 5301.8, while Intel Arc Pro B70 is still at 3824.64.<p>So there is still a bit more to go before Intel GPUs perform close to NVIDIA's.
Also the first section I jumped to :) To Intel's credit, seems they're slowly improving, the section starts with:<p>> Over the last year or two, Intel has worked to deliver serious optimizations for and compatibility with Blender GPU rendering on its Arc GPUs. Although NVIDIA has long held an advantage in the application, our last time looking at Intel’s cards indicated ongoing improvements. This round of testing is no different. We found that the Arc Pro B70 provided more than twice the performance of the B50, also beating the R9700 by 9%.
This is because Blender is in fact using CUDA?
Blender supports CUDA, HIP, OneAPI, and Metal. So Intel GPUs are performing poorly using their native API.
The key feature on intel platforms is the hardware de-noise acceleration (NVIDIA OptiX also works well.) Note, AMD OpenCL works quite well for some renderings, but blender flamenco likes consistent cluster hardware.<p>For 8k HDR10 media or 3+ screens the rtx 5090 32G model is going to be the minimum card people should buy. Just because you see 4 DP ports, doesn't mean the card can push bit-rates needed to fill an HDR10 display >60Hz.<p>The Mac Studio Pro unified >512GB ram/vram is a better LLM lab solution (Apple recently NERF'd it to 256GB.) Who cares if a task completes a bit slower, it doesn't matter given the lower error rates... and not costing $14k like an rtx 6000. =3<p>Great tutorial on getting blender to behave on mid-grade PC and laptops etc. :<p><a href="https://www.youtube.com/watch?v=a0GW8Na5CIE" rel="nofollow">https://www.youtube.com/watch?v=a0GW8Na5CIE</a>
I was looking into this for LLMs but it's clearly a graphics-processing focused card. The memory bandwidth is too low for that much RAM to be useful in an LLM context. The 5090 I have has the same amount of RAM but far more bandwidth and that makes it much more useful.
Compared to a B70, a 5090 is 1x the memory with 3x the bandwidth at 4x the price. Yeah, the 5090 is better, but you're paying for it.
Oh wow, I really would've expected higher memory bandwidth. That's only ~2-3x the little DGX Spark-alike I have to play with. Would've expected more.
It’s 32gb for people who can’t go for scalped 5090s but have a 3090 budget.<p>I have a pair of them with a 9480 and the only thing I have to do is keep the cache happy.
It is weird that the reviewer does not mention RTX PRO 6000 96GB, but mentioned RTX PRO 5000 72GB. 72GB RTX PRO 5000 is a special order, and much less people are aware of it. RTX PRO 6000 is known by mostly everyone in the LLM world.<p>I cannot understand why would a tech reviewer do that.
It should be possible to use the VRAM as extra swap space, when you're not using it for AI or gaming or anything else. 32GB is already more than a lot of computers have as just regular RAM, even sufficient to hold an OS installation:<p><a href="https://www.tomshardware.com/news/lightweight-windows-11-runs-entirely-in-gpus-vram" rel="nofollow">https://www.tomshardware.com/news/lightweight-windows-11-run...</a>
How should I update my simplistic understanding that decode is bw-bound with these results that show the B70 decoding faster than a 4090 (about 50% more bw)?
I would like one for the vram but I am sure they will be unobtainable after the initial stock sells out as I assume they were produced before the RAM prices went up.
Is Intel still making GPUs? I have heard so many conflicting things about will they/won't they stay in the market.
They appear to be backing out (for a little while) of consumer cards, but datacentre/workstation/laptop GPUs are still their focus.
Intel always had that habit of starting an internal conflict whenever whatever potential alternative revenue sources start to threaten their internal dependence on x86
What do you mean, are they still making GPUs? This is a discrete GPU that has just recently been released, and it's one of the most popular GPUs in its class at the moment, due to 32 GiB of RAM for under $1000, which makes it great for LLM inference.
I thought I had read that too and went to look for clarification and found that they’re just moving to a single architecture for their cards. Seemed reasonable.
The B70 would have been the B770 bit it was canceled. Celestial has been canceled too.
They'll always have iGPUs so whether or not they stay in the dGPU market depends mostly on whether or not people buy them. So they might not, whole market seems to be moving to SoCs/APUs/whatever you want to call them.
Not only will they always have iGPUs, but also cannot give up on advancing their datacenter AI GPUs (the next being Jaguar Shores). They need both of those far more than consumer or prosumer dGPUs, but that means they are committed to Big GPU work and Small GPU work.<p>Since they will have both of those big and small "bookends" of GPU architectures, it is a question of whether they see benefits in maintaining an accessible foothold in the midmarket ecosystem. I could make an argument for both sides of that, but obviously the decision is not up to me.
I don't know what to believe when it comes to Intel news because they have so many haters.
Can you use those AI cards for gaming too?<p>Or the makers intentionally nerf them, in order to better segment the markets/product lines?
The drivers often need per game optimisations these will be missing but I doubt Intel would nerf them, just rely on you not paying a lot for RAM the game won't use.
They nerf gaming cards to make money on the pro cards. Since this is a pro card it's not nerfed.
These seem amazing for hobbyist, but that TDP given the perf might be an issue deploying a lot of them
$950 for 23TF fp32? Have GPU performance grew in past 5-10 years at all?
this review was essentially pointless, they reviewed the card for a ton of workloads nobody in their right mind would pick it for, and left out the only use case where it makes sense. great job?
It looks like, if one can afford it, the R9700 is worth the extra money.<p>I read that Intel is getting out of the dGPU space, but then again, their iGPUs are really getting good. I can't understand why they'd give up the space when the AI market is so insane.
Rumors of their exit from dGPU predate Battlemage. So I wouldn't put a ton of credence to them. But Intel's is quite talented at snatching defeat from the jaws of victory.
I hope not. They’ve been flip flopping too much and the market needs more dGPU competition.<p>The team working on drivers is doing a good job playing catch up and I hope intel will continue to invest in cards that focus on graphics workloads and not just on AI inference.
Why are they still using their old Xe2/Battlemage architecture rather than their new Xe3/Celestial? They already used it in their Panther Lake chipset.
That's coming out in <a href="https://www.phoronix.com/review/intel-crescent-island" rel="nofollow">https://www.phoronix.com/review/intel-crescent-island</a> by around the end of the year.
It looks like B70 was delayed 1-2 years for some reason.
From what I've read the Intel drivers are terrible and holding back using them for LLMs.
Don't think that's true. The drivers are bad (not sure terrible is fair, they have improved a lot) esp for older directx etc games. But Vulkan support is pretty good and that's all you need for LLMs really.
I don't know about LLMs, but I tried an Intel card when Ubuntu Wayland couldn't initialize a 2 year old Nvidia. It just works.
That is just Linux and politics. Linux wants to force vendors to open source theirs, Intel plays along, Nvidia as the market lead does not, so you have to use their proprietary one, which most distros do not ship by default.
Interesting. I had read that Intel's Linux drivers were far behind their Windows versions. I haven't checked in a few months though.
Everyone has terrible drivers here aside from Nvidia.<p>Intel looks like they'll leave the dedicated GPU space, so it's a bit doubtful if the drivers will ever catch up.