KVarN: Native vLLM backend for KV-cache quantization by Huawei

(github.com)

112 points by theanonymousone9 hours ago

4 comments

throwa3562628 hours ago
Better performance than TQ and better quality than FP16?<p>Am I reading this right??
- qeternity7 hours ago
  It's not better quality: 59.3% vs 59.4% fp16 on AIME 25
- electroglyph2 hours ago
  any divergence (even if the benchmark is better) from full precision is error
- thefox967 hours ago
  Faster than Fp16, not better quality i guess
- pbich7 hours ago
  [dead]
v3ss0n8 hours ago
Why this is not a PR for vLLM ?
- electronsoup1 hour ago
  Why this is not a PR for llama.cpp
- woadwarrior014 hours ago
  Last I heard, vLLM was backed by a company that has raised $150m in seed funding. I'm sure they've got the resources to port it.
- esafak8 hours ago
  It's the output of a research paper; the authors are not trying to build up vLLM, and they probably have no incentive to do so. You can submit a PR, though! It's easier now while the divergence is low, so don't wait. Since there are six authors, I bet you could get help with the inevitable review chores if you just take the step of creating the PR.<p>edit: It might not be clear that it is based on vLLM 0.22, which is the current version: <a href="https://github.com/huawei-csl/KVarN/commit/d6290e99098d7426dcce01cdd8cc57a2eecf21a0" rel="nofollow">https://github.com/huawei-csl/KVarN/commit/d6290e99098d7426d...</a>. All you have to do is create a diff off it; it's fairly straightforward.
  - jmalicki8 hours ago
    And with the help of AI, pointing at AI at this paper and saying "making a vLLM PR from this paper" tends to work surprisingly well, even if you need to nudge it a little bit along the way.
- thefox967 hours ago
  it should be easy to do btw
0xjeffro2 hours ago
yao yao ling xian
shockembopper7 hours ago
[dead]