I have two ideas about how we’re setting up the evaluation.<p>First, the Bradley-Terry competition is mixing up what’s important for something to be shown with how new or unusual something is when it’s shown. Detail finds a problem, and three bots looking at pairs of changes also noticed a similar level in their scores, but people won't pay any extra attention to it in the actual running system. The engineer was going to see the problem from the bot’s review anyway. It would be good to specifically measure what Detail alone finds, versus what all the bots together find. We could look at that as a completely separate measure.<p>Second, when the same AI model is doing the judging, it’s being unfair. When I use a Claude AI to review code changes and a Codex AI to review the very same code changes as I’m developing, Codex finds issues that Claude is inclined to overlook. And Sonnet 4.6’s ratings of code, which come from a system run by Sonnet, also have some of that unfairness, even after the system summarizes the code.