Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

104 points by yu3zhou48 hours ago

10 comments

yu3zhou47 hours ago
README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code
- janalsncm1 hour ago
  Really practical teaching approach. I clicked in to see how safetensors are loaded and just kept reading. Thanks for sharing.
GoldenJade49 minutes ago
Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.
xuanlin3141 hour ago
The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.
juancn6 hours ago
Looks interesting, it reminds me of the first llama.cpp, but better documented.
nazgulsenpai7 hours ago
I love the documentation formatted in lessons. I can't wait to read through it.
dwa35925 hours ago
Very nice job on read me.<p>>>Physically, LLM is a file which contains a lot of float numbers.<p>aka atoms of the LLM.
- cyanydeez5 hours ago
  the universe is just atomic if statments
cookiengineer5 hours ago
Wanted to add that the author has an amazing blog with lots of interesting papers: <a href="https://jedrzej.maczan.pl/" rel="nofollow">https://jedrzej.maczan.pl/</a>
einpoklum5 hours ago
It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(
alexpandey27 minutes ago
[dead]
harshuljain135 hours ago
[dead]