Paper: <a href="https://arxiv.org/abs/2605.12825" rel="nofollow">https://arxiv.org/abs/2605.12825</a> ; Code+models: <a href="https://github.com/chiennv2000/orthrus" rel="nofollow">https://github.com/chiennv2000/orthrus</a> ; Disclosure: co-author.<p>Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model.<p>Results:<p>- Up to 7.8x TPF, ~6x wall-clock on MATH-500.<p>- 16% of params trained, <1B tokens, 24h on 8xH200.<p>- vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly.<p>- vs. Speculative Decoding (EAGLE-3, DFlash): no external drafter, no separate cache, zero TTFT penalty (no drafter to init/sync). KV overhead is O(1) (~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3).<p>- Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate.<p>Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only.
Do you plan on releasing the training code?
Amazing. Is it possible to do this with Qwen 3.6 27B? Will it work with quants (I assume so)?
From a quick and shallow view of the paper, it looks very feasible (with a little tinkering ) to be adapted to qwen3.6 27B. The process looks somewhat similar to training a LoRA, or in a way distilling your own model so that a mini model learns how to imitate it, and you glue them. I might bite the bullet and rent a gpu to do it for 3.6 27b, as this will solve a lot of my problems.
Scratch that, I don't have that kind of money, and 3.5's architecture is a little more divergent from 3's, so it will be a bit less trivial. It does look possible, just not on a student's paycheck.
So, it's D-Flash but at each transformer layer and share the KV cache of the original model? Very smart!