Ornith-1.0: Self-scaffolding LLMs for agentic coding

(deep-reinforce.com)

47 points by kordlessagain1 day ago

3 comments

SwellJoe1 day ago
I added this to a benchmark I've been doing of how well agents find security bugs, specifically security bugs originally found by Mythos. It performs poorly with only read/grep/ls tools, but in a follow-up test with a full shell and Python, it doubled its findings (still a poor showing, but it does at least indicate it is doing what it says on the tin: making tools to help it solve problems). It also did worse than Qwen AgentWorld, another recent post-train of Qwen 3.6 MoE intended for agentic use.<a href="https://swelljoe.com/post/will-it-mythos/" rel="nofollow">https://swelljoe.com/post/will-it-mythos/</a>
- kordlessagain18 hours ago
 Good to know. Thanks for the research!
Balinares11 hours ago
I'd have expected this to get more HN attention. Qwen 3.6 35B capability in a 9B model is a bonkers claim.
- juliangoldsmith5 hours ago
 It looks like they're comparing Orinth 9B to Qwen 3.5 35B, not Qwen 3.6. I guess it kind of makes sense since it's a finetune of 3.5, but I totally missed until I looked closely.In my brief tests, Ornith 35B performed quite well. It won't replace DeepSeek V4 Flash for me, but if it was fast and cheap enough it might.I don't remember being super impressed with Ornith 9B, but I could see it being on par with Qwen 3.5 35B.
- chid10 hours ago
 I thought so too when I read the headline but I expect it's basically Qwen3.5-9B
nzach11 hours ago
Instead of training the model to directly answer questions we trained the model to always write and execute the code that would solve the question ?If that is the case, this isn't just a fancy way to perform prompt optimization?