GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

(twitter.com)

22 points by laxmena1 hour ago

3 comments

genxy45 minutes ago
The context window is 16 characters. Talking about tokens per second is meaningless.
- dominotw12 minutes ago
 its not meaningless. there could be usecases like spell correction.
cadamsdotcom25 minutes ago
Transformers scale poorly vs. context window size and parameter count.Which means really impressive when those N’s are small!I’m but a pundit in this area so don’t know much. But one wonders if there’s a future in burning larger models to FPGAs - whether big enough FPGAs exist (or can be built), and whether locating specialized compute right with the memory it needs can speed things up.Likely would need a lot of algorithm parallelism work that’d translate back to CPUs/GPUs.
amelius1 hour ago
See also:<a href="https://rits.shanghai.nyu.edu/ai/karpathys-microgpt-on-fpga-loses-to-a-single-macbook-core-by-71x/" rel="nofollow">https://rits.shanghai.nyu.edu/ai/karpathys-microgpt-on-fpga-...</a>TL;DR: The CPU implementation was 71x faster than the FPGA.Note: model has only 4192 parameters.
- hedgehog19 minutes ago
 That post is uninteresting both because they miss the point, and it's not clear a human was even involved to perceive a point to miss. Sure, with an unlimited transistor budget, power budget, and a design clocked at 4GHz fabbed on 5nm one of the best CPU design teams in the world can make a thing that is straight line faster than a one-person project running at 80MHz on a 20 year old 65nm FPGA. Any other answer would be extremely surprising.Now, there are a bunch of interesting things about this project. Seeing the example of a tiny transformer running on FPGA is informative, and that it was apparently a pretty quick project for one person + robot assistance. Probably some transferable lessons for anyone else doing robo-FPGA development.<a href="https://github.com/fguzman82/gateGPT/tree/main/" rel="nofollow">https://github.com/fguzman82/gateGPT/tree/main/</a>
- cyanydeez44 minutes ago
 yeah, then theres prompt loading too.but anyone who can fit QWEN-3.6 35B with a sustained ~30 token/s and ~100k context with cache could print money as a hardware vendor.
 - wmf29 minutes ago
 That just sounds like a 3090.