13 comments

  • sacrelege6 hours ago
    Thanks for putting N-Day-Bench together - really interesting benchmark design and results.<p>I&#x27;d love to see how the model we serve, Qwen3.5 122B A10B, stacks up against the rest on this benchmark. AI Router Switzerland (aiRouter.ch) can sponsor free API access for about a month if that helps for adding it to the evaluation set.
    • Tepix1 hour ago
      Interesting. How fast is your service? Do you guarantee a certain number of tokens&#x2F;s?
    • ra4 hours ago
      Nice. I&#x27;ve been thinking of doing something similar in our local jurisdiction (Australia).<p>Are you able to share (or point me toward) any high-level details: (key hardware, hosting stack, high-level economics, key challenges)?<p>I&#x27;d love to offer to buy you a coffee but I won&#x27;t be in Switzerland any time soon.
  • linzhangrun8 hours ago
    Definitely possible. In January, I tried using Gemini to perform black-box&#x2F;white-box testing on an existing system in my company (it&#x27;s quite old). It successfully exploited a hidden SQL injection vulnerability to penetrate the system and extract password hashes (not particularly strong passwords, successfully decrypted on a public website). In terms of pure skill level, I&#x27;d say this is at least the level of a mid-level cybersecurity professional, not even considering the significant efficiency improvement.
  • Cynddl10 hours ago
    &gt; Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code.<p>Curator, answer key, Finder, shell steps, structured report, sink hints… I understand nothing. Did you use an LLM to generate this HN submission?<p>It looks like a standard LLM-as-a-judge approach. Do you manually validate or verify some of the results? Done poorly, the results can be very noisy and meaningless.
    • yorwba1 hour ago
      Yeah, the LLM judge is a bit too gullible. GLM 5.1 here <a href="https:&#x2F;&#x2F;ndaybench.winfunc.com&#x2F;traces&#x2F;trace_585887808ff443cca3b5773edb4286f8">https:&#x2F;&#x2F;ndaybench.winfunc.com&#x2F;traces&#x2F;trace_585887808ff443cca...</a> claims that onnx&#x2F;checker.cc doesn&#x27;t reject hardlinks, even though it does (and the model output even quotes the lines that perform the check). The actual patch <a href="https:&#x2F;&#x2F;github.com&#x2F;onnx&#x2F;onnx&#x2F;commit&#x2F;4755f8053928dce18a61db8fec71b69c74f786cb" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;onnx&#x2F;onnx&#x2F;commit&#x2F;4755f8053928dce18a61db8f...</a> instead adds using std::filesystem::weakly_canonical to catch path traversal through symlinks. It also adds a Python function that does the same (?) checks when saving files. Honestly, even that patch seems LLM-generated to me, the way it duplicates code in a bunch of places instead of channeling all file accesses through a single hardened function.<p>Anyway, GLM 5.1 gets a score of 93 for its incorrect report.
    • rohansood159 hours ago
      I worked in AppSec in the past, made sense to me. Maybe you aren&#x27;t the target audience?<p>You don&#x27;t really need manual verification for these, the CVEs (vulnerabilities) are public and can be programmatically validated.
      • muldvarp4 hours ago
        Manual verification that the &quot;judge&quot; judges correctly.<p>Also, how exactly do you programmatically validate CVEs?
    • johnfn8 hours ago
      Is this really that hard to parse?<p>Curator and Finder are the names of the agents. &quot;answer key&quot; - haven&#x27;t you ever taken a test in high school? It&#x27;s an explanation of the answer. &quot;shell steps&quot; I presume means it gets to run 24 commands on the shell. &quot;structured report&quot; - do I really need to explain to you what a report is? &quot;sink hints&quot; - I admit I didn&#x27;t know this one, but a bit of searching indicates that it&#x27;s a hint at where the vulnerability lies.
    • peyton10 hours ago
      &gt; Did you use an LLM to generate this HN submission?<p>Must have.<p>&gt; The Finder will never see the patch.<p>I wasn’t worried that this eval would show the answer to the model before evaluating it. Seems requirements leaked into this post.
  • StrauXX4 hours ago
    Do you plan on adding more models in the future? I would love to see how other OSS modles like Gemma, GPT-OSS and Qwen fare.
  • mbbutler11 hours ago
    It would be helpful to add in some cases that do not contain any vulnerabilities to assess false-positive rate as well.
    • mufeedvh11 hours ago
      This is a good idea.<p>Will incorporate false-positive rates into the rubric from the next run onwards.<p>At winfunc, we spent a lot of research time taming these models to eradicate false-positive rates (it&#x27;s high!) so this does feel important enough to be documented. Thanks!
    • cortesoft10 hours ago
      Any code that is certain that it doesn&#x27;t have any vulnerabilities is going to be pretty trivial to verify.
  • spicyusername8 hours ago
    I&#x27;d love to see some of the open source models in there
    • PeterStuer3 hours ago
      You mean like Kimi-K2.5 or GLM 5.1?
  • Rohinator11 hours ago
    Very curious how Claude Mythos will perform here
  • ajaystream6 hours ago
    [dead]
  • aos_architect10 hours ago
    [dead]
  • volume_tech9 hours ago
    [dead]
  • phantomoc11 hours ago
    [dead]
  • withinboredom2 hours ago
    I didn’t read tfa, but can we also have it be able to distinguish when a vulnerability doesn’t apply? As an open source contributor, people open nonsensical security issues all the time. It’s getting annoying.