3 comments

  • genxy45 minutes ago
    The context window is 16 <i>characters</i>. Talking about tokens per second is meaningless.
    • dominotw12 minutes ago
      its not meaningless. there could be usecases like spell correction.
  • cadamsdotcom25 minutes ago
    Transformers scale poorly vs. context window size and parameter count.<p>Which means really impressive when those N’s are small!<p>I’m but a pundit in this area so don’t know much. But one wonders if there’s a future in burning larger models to FPGAs - whether big enough FPGAs exist (or can be built), and whether locating specialized compute right with the memory it needs can speed things up.<p>Likely would need a lot of algorithm parallelism work that’d translate back to CPUs&#x2F;GPUs.
  • amelius1 hour ago
    See also:<p><a href="https:&#x2F;&#x2F;rits.shanghai.nyu.edu&#x2F;ai&#x2F;karpathys-microgpt-on-fpga-loses-to-a-single-macbook-core-by-71x&#x2F;" rel="nofollow">https:&#x2F;&#x2F;rits.shanghai.nyu.edu&#x2F;ai&#x2F;karpathys-microgpt-on-fpga-...</a><p>TL;DR: The CPU implementation was 71x faster than the FPGA.<p>Note: model has only 4192 parameters.
    • hedgehog19 minutes ago
      That post is uninteresting both because they miss the point, and it&#x27;s not clear a human was even involved to perceive a point to miss. Sure, with an unlimited transistor budget, power budget, and a design clocked at 4GHz fabbed on 5nm one of the best CPU design teams in the world can make a thing that is straight line faster than a one-person project running at 80MHz on a 20 year old 65nm FPGA. Any other answer would be extremely surprising.<p>Now, there are a bunch of interesting things about this project. Seeing the example of a tiny transformer running on FPGA is informative, and that it was apparently a pretty quick project for one person + robot assistance. Probably some transferable lessons for anyone else doing robo-FPGA development.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;fguzman82&#x2F;gateGPT&#x2F;tree&#x2F;main&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;fguzman82&#x2F;gateGPT&#x2F;tree&#x2F;main&#x2F;</a>
    • cyanydeez44 minutes ago
      yeah, then theres prompt loading too.<p>but anyone who can fit QWEN-3.6 35B with a sustained ~30 token&#x2F;s and ~100k context with cache could print money as a hardware vendor.
      • wmf29 minutes ago
        That just sounds like a 3090.