> the slow performance decays<p>the decays are just more capable other models entering the population, making all prior models lose more frequently
For what it's worth, I work at OpenAI and I can guarantee you that we don't switch to heavily quantized models or otherwise nerf them when we're under high load. It's true that the product experience can change over time - we're frequently tweaking ChatGPT & Codex with the intention of making them better - but we don't pull any nefarious time-of-day shenanigans or similar. You should get what you pay for.
> we don't switch to heavily quantized models<p>That sounded like a press bulletin, so just to let you clarify yourself: Does that mean you may switch to <i>lightly</i> quantized models?
There's almost 0% chance that OpenAI doesn't quantize the model right off the bat.<p>I am willing to bet large amounts of money that OpenAI would never release a model served as fully BF16 in the year of our lord 2026. That would be insane operationally. They're almost certainly doing QAT to FP4 for FFN, and a similar or slightly larger quant for attention tensors.
It's ok if they never release a BF16 model, but it's less ok if they release it, win the benchmarks, then quantise it after a few weeks.
Thank you for your answer. I have a similar question as OP, but in regards of the GPT models in MS copilot. My experience is that the response quality is much better when calling the API directly or through the webUI.<p>I know this might be a question that's impossible for you to answer, but if you can shed any light to this matter, I'd be grateful as I am doing an analysis over what AI solutions that can be suitable for my organisation.
its very interesting to see that this only happens to American companies. What gives?
Seems like Chinese labs are the only ones that are trustworthy (at least when it gets to this specific issue). This feels so ironic haha
I am using novita-hosted DeepSeek V4 (Flash) for work and DeepSeek API for personal projects.<p>Novita's has occassional problem counting white space. DeepSeek hosted does not.<p>No idea why.
there is something greatly trustworthy about open source
Neat. Would you add the option to normalize the elo over time (e.g update the model used as an anchor for the elo computation) so the diff between labs is more visible?
The Elo rating system measures <i>relative</i> performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.<p>You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.
The relative and auto-scaling nature of Elo ranking feels like an advantage here.<p>Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them.
Advantage for what exactly though? I'm not saying Elo Ranking doesn't give any information. It just doesn't give the information that the OP's project claims to be able to give: that models get nerfed over time. You could extract this kind of information from the raw results of each evaluation round between two models, ignoring any new model entries and compare these over time but not from the resulting Elo scores with an ever changing list of models.<p>New models are on average better than older models, the average skill of the population of models increases over time and so you are mathematically guaranteed that any existing model will over time degrade in Elo score even though it didn't change itself in any way.<p>It's like benchmarking a model against a list of challenges that over time are made more and more difficult and then claiming the model got nerfed because its score declined.<p>Elo is good at establishing an overall ranking order across models but that's not what this is about.<p>To detect nerfing of a model, projects like <a href="https://marginlab.ai/trackers/claude-code/" rel="nofollow">https://marginlab.ai/trackers/claude-code/</a> are much much better (I'm not affiliated in any way).
Is that strictly true? ELO rankings do also inflate over time (looking at you, Chess GMs)
Elo systems often include one or more ways new points can enter the system. The system used by the European Go Federation has three ways iirc: 1. Cannot go under 100, 2. Cannot lose more than 100 points in one tournament, 3. Weaker player beating a stronger one (which is countered by the stronger player beating the weaker one, but it's not balanced: if two people only play each other forever and ever, both of their Elos will grow).
The interesting thing I find is how Anthropic has been more consistently improving over time in the last few years, that allows it to catchup and surpass OpenAI and Google. The latter two have pretty much plateau over the last year or so. GPT 5.5 is somehow not moving the needle at all.<p>I hope to see the other labs can bring back competition soon!
Although Arena is adversarial and resistant to goodharting, it's not immune. Models that train on Arena converge on helpfulness, not necessarily truthiness
FYI, Elo isn't an acronym - it's a person's name. No need to capitalize it as ELO.
Electric Light Orchestra anyone?
Unless you've just missed your last train to London.
You’re right:
<a href="https://en.wikipedia.org/wiki/Elo_rating_system" rel="nofollow">https://en.wikipedia.org/wiki/Elo_rating_system</a>
This is great, but personally, I really wish we had an Elo leaderboard specifically for the quality of coding agents.<p>Honestly, in my opinion, GPT-5.5 Codex doesn't just crush Claude Code 4.7 opus —it's writing code at a level so advanced that I sometimes struggle to even fully comprehend it. Even when navigating fairly massive codebases spanning four different languages and regions (US, China, Korea, and Japan), Codex's performance is simply overwhelming.<p>How would we even go about properly measuring and benchmarking the Elo for autonomous agents like this?
Very neat! It would be great to extend it to non-flagship models as well.
It seems to be a USA only thing, Chinese models and Mistral don't show any downward trend.
Sure they do. Most models are on a downward trend because newer models are moving into top spots.
Wouldn't it be really weird if a open-weight model dropped in performance? Because then, it would rather be the Elo ranking
Is this slop? It has wildly aggressive language that agrees with a subset of pop sentiment, re: models being “nerfed”. It promises to reveal this nerfing. Then, it goes on to…provide an innocuous mapping of LM Arena scores that always go up?
[dead]