Show HN: We post-trained a model that pen tests instead of refusing

(argusred.com)

30 points by dk1895 hours ago

5 comments

mkaszkowiak0 minutes ago
What was your approach to benchmarking an adversarial agent?<p>This is an open problem that I came across (in a different domain), as the search space can be really wide. It's hard to measure results for non-trivial tasks.<p>Would be really interested if you can share your eval approach :)
cortesoft30 minutes ago
> This won't be made available to anyone and everyone, but we do believe that responsible SMEs and midmarket companies also need access to these tools in order to identify key vulnerabilities in their systems; not just enterprises.<p>So this is the same policy that Anthropic and OpenAI have, it is just based on your criteria rather than theirs.
- dk18913 minutes ago
  I think the policy universally makes sense, who would want to give a tool like this to bad actors? But it does leave a big section of the market underserved. Particularly when Mythos was made accessible to very large orgs and then Fable was pulled on export grounds.
  - cyanydeez8 minutes ago
    It's really absurd to think any of these models can be protected _by commercial interests_. They couldn't keep from hiring north koreans anymore than they'll stop bad actors from operationalizing these models.
- kennyadam12 minutes ago
  As soon as I read that I literally scoffed. Doublethink at its finest. Doubleplusungood.
andai32 minutes ago
Fantastic. Could you share more details what it was like post-training a model?
Catloafdev25 minutes ago
Why create an offensive tool rather than a repo-scanning tool?<p>I can't think of any way to safely release an offensive tool publicly.
- dk1893 minutes ago
  You need both, scanning for your own code, pen testing to actually prove vulnerabilities, otherwise it can be very noisy and one of the things that most tools currently suffer from is they give you too many false positives. For the moment. The pen testing we gated it for now until we resolve the debate of safety.