FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore. 50% of the die area or more is for fixed function matrix multiplication units and associated dedicated storage. This just isn't general purpose anymore. FPGAs cannot rival this with their configurable DSP slices. They would need dedicated systolic blocks, which they aren't getting. The closest thing is the versal ML tiles, and those are entire peoxessors, not FPGA blocks. Those have failed by being impossible to program.
> FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore.<p>Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.
>Those have failed by being impossible to program.<p>I think you spoke too soon about their failure, sooner they will be much easier to program [1].<p>Interestingly, Nvidia GPU now is also moving to tile-based GPU programming model that targets portability for NVIDIA Tensor Cores [2]. Recently there're discussions on the topic at HN [3].<p>[1] Developing a BLAS Library for the AMD AI Engine [pdf]:<p><a href="https://uni.tlaan.nl/thesis/msc_thesis_tristan_laan_aieblas.pdf" rel="nofollow">https://uni.tlaan.nl/thesis/msc_thesis_tristan_laan_aieblas....</a><p>[2] NVIDIA CUDA Tile:<p><a href="https://developer.nvidia.com/cuda/tile" rel="nofollow">https://developer.nvidia.com/cuda/tile</a><p>[3]CUDA Tile Open Sourced (103 comments):<p><a href="https://news.ycombinator.com/item?id=46330732">https://news.ycombinator.com/item?id=46330732</a>
The amd npu and versal ML tiles (same underlying architecture) have been an complete failure. Dynamic programming models like cu tile do not work on them at all, be cause they require an entirely static graph to function. AMD is going to walk away from their NPU architecture and unify around their GPU IP on inference products in the future.
I think it'll get to a point with quantisation that GPUs that run them will be more FPGA like than graphics renderers.
If you quantize far enough things begin to look more like gates than floating point units. At that level a FPGA wouldn't run your model, it would be one your model.
Turns out that a lot of interesting computation can be expressed as a matrix multiplication.
I feel like your entire comment is a self contradicting mess.<p>You say FPGAs won't get dedicated logic for ML, then you say they did.<p>Why does it matter whether the matrix multiplication units inside the AI Engine are a systolic array or not? The multipliers support 512 bit inputs which means 4x8 times 8x4 for bfloat16 with one multiplication per cycle and bigger multiplications with smaller data types. Since it is a VLIW processor, it is much easier to achieve full utilisation of the matrix multiplication units, because you can run loads, stores and process tiles all simultaneously in the same cycle.<p>The only thing that might be a challenge is arranging the communication between the AI Engines, but even that should be blatantly obvious. If you are doing matrix multiplication, you should be using the entire array in exactly the pattern you think they should be using internally.<p>Who knows, maybe there is a way to implement flash attention like that too.
The versal stuff isn't really an FPGA anymore. The chips have PL on them, but many don't. The consumer NPUs from AMD are the same versal aie cores with no PL. They just aren't configurable blocks in fabric anymore and don't have the same programming model. So I'm not contradicting myself here.<p>That being said, versal aie for ml has been a terrible failure. The reasons for why are complicated. One reason is because the memory hierarchy for SRAM is not a unified pool. It's partitioned into tiles and can't be accessed by all cores. additionally, access of this SRAM is only via dma engines and not directly from the cores. Thirdly, the datapaths for feeding the VLIW cores are statically set, and require a software configuration to change at runtime which is slow. Programming this thing makes the cell processor look like a cakewalk. You gotta program dma engines, you program hundreds of VLIW cores, you need to explicitly setup on chip network fabric. I could go on.<p>Anyway, my point is FPGAs aren't getting ML slices. Some FPGAs do have a completely separate thing that can do ML, but what is shipped is terrible. Hopefully that makes sense.
I don't think this is correct. For inference, the bottleneck is memory bandwidth, so if you can hook up an FPGA with better memory, it has an outside shot at beating GPUs, at least in the short term.<p>I mean, I have worked with FPGAs that outperform H200s in Llama3-class models a while and a half ago.
Show me a single FPGA that can outperform a B200 at matrix multiplication (or even come close) at any usable precision.<p>B200 can do 10 peta ops at fp8, theoretically.<p>I do agree memory bandwidth is also a problem for most FPGA setups, but xilinx ships HBM with some skus and they are not competitive at inference as far as I know.
yup, GBs are so much tensor core nowadays :)