Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

(senior-swe-bench.snorkel.ai)

17 points by matt_d1 hour ago

6 comments

purple-leafy43 minutes ago
Benchmarks are great, but I feel like there’s a better way this seems quite subjective.What you really need is an objective benchmark
- eli24 minutes ago
 I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
- echelon30 minutes ago
 > What you really need is an objective benchmark"When are all the software engineers unemployed?"
 - purple-leafy10 minutes ago
 Not sure I follow haha
jonathanleane52 minutes ago
Top solve rate is currently 24% with Opus 4.8... What's a competent human supposed to score?
- lacunary43 minutes ago
 presumably whatever the top model uses and then some, since the human can use the model.I wonder if a model could score higher if it had a human at its disposal?
LiamPowell29 minutes ago
> You are a senior SWE-Bench reviewer, make no mistakes.I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
Madmallard5 minutes ago
next round of trust me bro benchmarks
danpalmer48 minutes ago
Why didn't they just make it "Staff SWE-Bench", would be much better smh. /sBut seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.
- amrrs32 minutes ago
 As someone who's trying to get better assessments, I'm struggling to come up with objective coding tasks that evaluates all aspects of real life like planning, design choices, problem solving and context usage. From your experience with humans, Do you have any recommendations on what could be effective in measuring it?
jocelyner24 minutes ago
[flagged]