3 comments

  • SwellJoe1 day ago
    I added this to a benchmark I&#x27;ve been doing of how well agents find security bugs, specifically security bugs originally found by Mythos. It performs poorly with only read&#x2F;grep&#x2F;ls tools, but in a follow-up test with a full shell and Python, it doubled its findings (still a poor showing, but it does at least indicate it is doing what it says on the tin: making tools to help it solve problems). It also did worse than Qwen AgentWorld, another recent post-train of Qwen 3.6 MoE intended for agentic use.<p><a href="https:&#x2F;&#x2F;swelljoe.com&#x2F;post&#x2F;will-it-mythos&#x2F;" rel="nofollow">https:&#x2F;&#x2F;swelljoe.com&#x2F;post&#x2F;will-it-mythos&#x2F;</a>
  • Balinares11 hours ago
    I&#x27;d have expected this to get more HN attention. Qwen 3.6 35B capability in a 9B model is a <i>bonkers</i> claim.
    • juliangoldsmith5 hours ago
      It looks like they&#x27;re comparing Orinth 9B to Qwen 3.5 35B, not Qwen 3.6. I guess it kind of makes sense since it&#x27;s a finetune of 3.5, but I totally missed until I looked closely.<p>In my brief tests, Ornith 35B performed quite well. It won&#x27;t replace DeepSeek V4 Flash for me, but if it was fast and cheap enough it might.<p>I don&#x27;t remember being super impressed with Ornith 9B, but I could see it being on par with Qwen 3.5 35B.
    • chid10 hours ago
      I thought so too when I read the headline but I expect it&#x27;s basically Qwen3.5-9B
  • nzach11 hours ago
    Instead of training the model to directly answer questions we trained the model to always write and execute the code that would solve the question ?<p>If that is the case, this isn&#x27;t just a fancy way to perform prompt optimization?