N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

(ndaybench.winfunc.com)

68 points by mufeedvh11 hours ago

13 comments

sacrelege6 hours ago
Thanks for putting N-Day-Bench together - really interesting benchmark design and results.I'd love to see how the model we serve, Qwen3.5 122B A10B, stacks up against the rest on this benchmark. AI Router Switzerland (aiRouter.ch) can sponsor free API access for about a month if that helps for adding it to the evaluation set.
- Tepix1 hour ago
 Interesting. How fast is your service? Do you guarantee a certain number of tokens/s?
- ra4 hours ago
 Nice. I've been thinking of doing something similar in our local jurisdiction (Australia).Are you able to share (or point me toward) any high-level details: (key hardware, hosting stack, high-level economics, key challenges)?I'd love to offer to buy you a coffee but I won't be in Switzerland any time soon.
linzhangrun8 hours ago
Definitely possible. In January, I tried using Gemini to perform black-box/white-box testing on an existing system in my company (it's quite old). It successfully exploited a hidden SQL injection vulnerability to penetrate the system and extract password hashes (not particularly strong passwords, successfully decrypted on a public website). In terms of pure skill level, I'd say this is at least the level of a mid-level cybersecurity professional, not even considering the significant efficiency improvement.
Cynddl10 hours ago
> Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code.Curator, answer key, Finder, shell steps, structured report, sink hints… I understand nothing. Did you use an LLM to generate this HN submission?It looks like a standard LLM-as-a-judge approach. Do you manually validate or verify some of the results? Done poorly, the results can be very noisy and meaningless.
- yorwba1 hour ago
 Yeah, the LLM judge is a bit too gullible. GLM 5.1 here <a href="https://ndaybench.winfunc.com/traces/trace_585887808ff443cca3b5773edb4286f8">https://ndaybench.winfunc.com/traces/trace_585887808ff443cca...</a> claims that onnx/checker.cc doesn't reject hardlinks, even though it does (and the model output even quotes the lines that perform the check). The actual patch <a href="https://github.com/onnx/onnx/commit/4755f8053928dce18a61db8fec71b69c74f786cb" rel="nofollow">https://github.com/onnx/onnx/commit/4755f8053928dce18a61db8f...</a> instead adds using std::filesystem::weakly_canonical to catch path traversal through symlinks. It also adds a Python function that does the same (?) checks when saving files. Honestly, even that patch seems LLM-generated to me, the way it duplicates code in a bunch of places instead of channeling all file accesses through a single hardened function.Anyway, GLM 5.1 gets a score of 93 for its incorrect report.
- rohansood159 hours ago
 I worked in AppSec in the past, made sense to me. Maybe you aren't the target audience?You don't really need manual verification for these, the CVEs (vulnerabilities) are public and can be programmatically validated.
 - muldvarp4 hours ago
 Manual verification that the "judge" judges correctly.Also, how exactly do you programmatically validate CVEs?
- johnfn8 hours ago
 Is this really that hard to parse?Curator and Finder are the names of the agents. "answer key" - haven't you ever taken a test in high school? It's an explanation of the answer. "shell steps" I presume means it gets to run 24 commands on the shell. "structured report" - do I really need to explain to you what a report is? "sink hints" - I admit I didn't know this one, but a bit of searching indicates that it's a hint at where the vulnerability lies.
- peyton10 hours ago
 > Did you use an LLM to generate this HN submission?Must have.> The Finder will never see the patch.I wasn’t worried that this eval would show the answer to the model before evaluating it. Seems requirements leaked into this post.
StrauXX4 hours ago
Do you plan on adding more models in the future? I would love to see how other OSS modles like Gemma, GPT-OSS and Qwen fare.
mbbutler11 hours ago
It would be helpful to add in some cases that do not contain any vulnerabilities to assess false-positive rate as well.
- mufeedvh11 hours ago
 This is a good idea.Will incorporate false-positive rates into the rubric from the next run onwards.At winfunc, we spent a lot of research time taming these models to eradicate false-positive rates (it's high!) so this does feel important enough to be documented. Thanks!
- cortesoft10 hours ago
 Any code that is certain that it doesn't have any vulnerabilities is going to be pretty trivial to verify.
spicyusername8 hours ago
I'd love to see some of the open source models in there
- PeterStuer3 hours ago
  You mean like Kimi-K2.5 or GLM 5.1?
Rohinator11 hours ago
Very curious how Claude Mythos will perform here
ajaystream6 hours ago
[dead]
aos_architect10 hours ago
[dead]
volume_tech9 hours ago
[dead]
phantomoc11 hours ago
[dead]
withinboredom2 hours ago
I didn’t read tfa, but can we also have it be able to distinguish when a vulnerability doesn’t apply? As an open source contributor, people open nonsensical security issues all the time. It’s getting annoying.