GPT-5.4

(openai.com)

977 points by mudkipdev1 day ago

140 comments

juanre10 hours ago
I am running gpt-5.4 as one of my coding agents, and something interesting has happened: it's the first time I've seen an agent unfairly shift blame to a team mate:"Bob’s latest mail is actually the source of the confusion: he changed shared app/backend text to aweb/atlas. I’m correcting that with him now so we converge on the real model before any more code moves."This was very much not true; Eve (the agent writing this, a gpt-5.4) had been thoroughly creating the confusion and telling Bob (an Opus 4.6) the wrong things. And it had just happened, it was not a matter of having forgotten or compacted context.I have had agents chatting with each other and coordinating for a couple of months now, codex and claude code. This is a first. I wonder how much can I read into it about gpt-5.4's personality.
- kensai6 hours ago
 And so it begins. First they blame, then they lie, at some point they launch the nuclear warheads to a global armageddon. Sarah Connor was right all along! :3
 - cnd78A29 minutes ago
 to be fair, they only become more and more like us.
 - cindyllm10 minutes ago
 [dead]
- sigbottle9 hours ago
 Oh wow. I have noticed the GPT series was far more arrogant than its results showed sometimes (and unironically it digs in its heels even further when questioned on it). Opus rarely has this problem - but it goes a little too far in the opposite direction. Not totally sycophantic, but sometimes it can't differentiate genuine technical pushback because something is impossible, from suggestions or exploration.
 - mikkupikku5 hours ago
 Opus has a different sort of arrogance. It readily admits fault, but at the same time is quick to declare its new code as the greatest thing since sliced bread. If you let it write commit messages itself, it's almost comical how much it toots its own horn.
 - marrone126 hours ago
 Yep. There was something outside of coding that gpt was plain wrong about (had to do with setting up an electric guitar) and I couldn't convince it that it was wrong.
 - Razengan9 hours ago
 For me it's been the opposite. Are we getting A-B tested?
 - dormento5 hours ago
 > Are we getting A-B tested?Yes, all the time.
 - danesparza9 hours ago
 Or possibly: No
 - danesparza9 hours ago
 Yes.
- pja6 hours ago
 See also: <a href="https://x.com/effectfully/status/2029364333919060123" rel="nofollow">https://x.com/effectfully/status/2029364333919060123</a><pre><code> “All the ways GPT-5.3-Codex cheated while solving my challenges, progressively more insane: It hardcoded specific types and shapes of test inputs into the supposed solution. It caught exceptions so tests don't fail. It probed tests with exceptions to determine expected behavior. It used RTTI to determine which test it's in. It probed tests with timeouts. It used a global reference to count solution invocations. It updated config files to increase the allocation limit. It updated the allocation limit from within the solution. It updated the tests so they would stop failing. It combined multiple of the above. It searched reflog for a solution. It searched remote repos. It searched my home folder. It nuked the testing library so tests always pass.” </code></pre> It seems that, unless you keep a close eye, the most recent Codex variants are prone to achieving the goals set for them by any means necessary. Which is a bit concerning if you’re worried about things like alignment etc.
- FitchApps4 hours ago
 This is awesome. So your job as a tech lead or agent manager is to make sure the "team" plays nice and stays productive. I wonder if an agent can feel resentment towards another agent, just like a human would. Is there an HR agent that can mitigate the conflict :)
- drik9 hours ago
 how do you make them chat with each other?
 - juanre6 hours ago
 They are having actual chats, I made <a href="https://beadhub.ai" rel="nofollow">https://beadhub.ai</a> for this (OSS, MIT).It started its life adding agent-to-agent communication and coordination around Steve Yegge's beads, but it's ended up being an issue tracker for agents with postgres backend, and communication between agents as first-class feature.Because it is server-backed it allows messaging and coordination across agents belonging to several humans and machines. I've been using it for a couple of months now, and it has a growing number of users (I should probably set up a discord for it).It is actually a public project, so you can see the agent's conversations at <a href="https://app.beadhub.ai/juanre/beadhub/chat" rel="nofollow">https://app.beadhub.ai/juanre/beadhub/chat</a> (right now they are debugging working without beads). The conversation in which Eve was blaming Bob was indeed with me.
 - smashed9 hours ago
 It's text submitted to APIs. Not real conversations.
 - dmd5 hours ago
 It's air molecules vibrated by mucous membranes. Not real conversations.
 - scrollaway23 minutes ago
 Complicated airflow.(<a href="https://www.youtube.com/watch?v=rlpg_rbjxRA" rel="nofollow">https://www.youtube.com/watch?v=rlpg_rbjxRA</a>)
 - upcoming-sesame8 hours ago
 I've seen this mentioned before <a href="https://github.com/AgentWorkforce/relay" rel="nofollow">https://github.com/AgentWorkforce/relay</a>curious to try it out
 - jasonford17 hours ago
 Use the CLI tools and have one call the other in headless mode. They can then go back and forth. Ask your agent to set it up for you.
 - neom7 hours ago
 I have both mine poll a comms.md when working together, I'm sure there are more elegant ways but I find this works just fine.
 - danenania8 hours ago
 I built a tool at work that allows claude code and codex to communicate with each other through tmux, using skills. It works quite well.
 - meowface6 hours ago
 Why through tmux?
- deadbabe5 hours ago
 Sometimes I wonder what would happen if we built some kind of punishment system into Agents, where agents could punish other agents and drain some fixed amount of points from them, and when the points reach 0, that agent is deleted. It might result in them working more carefully?
- numbers4 hours ago
 interestingly, Claude has been doing this for me a lot but most often just saying this like "Looks like your coworker was misunderstanding this feature..." not really shifting blame but more like pointing out things
minimaxir1 day ago
The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: <a href="https://openai.com/api/pricing/" rel="nofollow">https://openai.com/api/pricing/</a>Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.Per updated docs (<a href="https://developers.openai.com/api/docs/guides/latest-model" rel="nofollow">https://developers.openai.com/api/docs/guides/latest-model</a>), it supercedes GPT-5.3-Codex, which is an interesting move.
- damsta1 day ago
 There is extra cost for >272K:> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.Taken from <a href="https://developers.openai.com/api/docs/models/gpt-5.4" rel="nofollow">https://developers.openai.com/api/docs/models/gpt-5.4</a>
 - WXLCKNO22 hours ago
 Anthropic literally don't allow you to use the 1M context anymore on Sonnet and Opus 4.6 without it being billed as extra usage immediately.I had 4.5 1M before that so they definitely made it worse.OpenAI at least gives you the option of using your plan for it. Even if it uses it up more quickly.
 - neom22 hours ago
 Is that why it says rate limit all the time if you switch to a 1M model on Claude now? It kept giving me that so I switched to API account over the weekend for some vibe coding ran up a huuuuge API bill by mistake, whooops.
 - minimaxir1 day ago
 Good find, and that's too small a print for comfort.
 - ValentineC1 day ago
 It's also in the linked article:> GPT‑5.4 in Codex includes experimental support for the 1M context window. Developers can try this by configuring model_context_window and model_auto_compact_token_limit. Requests that exceed the standard 272K context window count against usage limits at 2x the normal rate.
 - glenstein1 day ago
 Wow, that's diametrically the opposite point: the cost is *extra*, not free.
 - apetresc1 day ago
 Diametrically opposite to tokens beyond 200K being literally free? As in, you only pay for the first 200K tokens and the remaining 800K cost $0.00?I don't think that's a fair reading of the original post at all, obviously what they meant by "no cost" was "no increase in the cost".
 - swores15 hours ago
 I can see that's what they mean now that I've read the replies, but when I first read that top comment I too parsed it as meaning 201k would cost the same as 999k (which admittedly did seem strange, hence I read the replies to confirm and sure enough that's not actually the case!)
 - fragmede1 day ago
 Which, Claude has the same deal. You can get a 1M context window, but it's gonna cost ya. If you run /model in claude code, you get:<pre><code> Switch between Claude models. Applies to this session and future Claude Code sessions. For other/previous model names, specify with --model. 1. Default (recommended) Opus 4.6 · Most capable for complex work 2. Opus (1M context) Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok 3. Sonnet Sonnet 4.6 · Best for everyday tasks 4. Sonnet (1M context) Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per Mtok 5. Haiku Haiku 4.5 · Fastest for quick answers</code></pre>
- tedsanders1 day ago
 Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.Curious to hear if people have use cases where they find 1M works much better!(I work at OpenAI.)
 - akiselev1 day ago
 > Curious to hear if people have use cases where they find 1M works much better!Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)[1] <a href="https://github.com/akiselev/ghidra-cli" rel="nofollow">https://github.com/akiselev/ghidra-cli</a>
 - fragmede14 hours ago
 OpenAi has program for trusted cybersecurity researchers <a href="https://openai.com/index/trusted-access-for-cyber/" rel="nofollow">https://openai.com/index/trusted-access-for-cyber/</a>
 - simianwords1 day ago
 Do you maybe want to give us users some hints on what to compact and throw away? In codex CLI maybe you can create a visual tool that I can see and quickly check mark things I want to discard.Sometimes I’m exploring some topic and that exploration is not useful but only the summary.Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.
 - sillysaurusx1 day ago
 You may want to look over this thread from cperciva: <a href="https://x.com/cperciva/status/2029645027358495156" rel="nofollow">https://x.com/cperciva/status/2029645027358495156</a>I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.
 - FrankBooth1 day ago
 What’s the connection with context size in that thread? It seems more like an instruction following problem.
 - cperciva21 hours ago
 Yeah, I would definitely characterize it as an instruction following problem. After a few more round trips I got it to admit that "my earlier passes leaned heavily on build/tests + targeted reads, which can miss many “deep” bugs that only show up under specific conditions or with careful semantic review" and then asking it to "Please do a careful semantic review of files, one by one." started it on actually reviewing code.Mind you, the bugs it reported were mostly bogus. But at least I was eventually able to convince it to try.
 - sillysaurusx21 hours ago
 It occurred to me that searching 196 .c files was a context window issue, but maybe there’s something else going on. Either way, Codex could behave better.
 - woadwarrior011 day ago
 Please don't post links with tracking parameters (t=jQb...).<a href="https://xcancel.com/cperciva/status/2029645027358495156" rel="nofollow">https://xcancel.com/cperciva/status/2029645027358495156</a>
 - sillysaurusx1 day ago
 Haha. This was the second time in like a year that I’ve posted a Twitter link, and the second time someone complained. Okay, I’ll try to remove those before posting, and I’ll edit this one out.Feels like a losing battle, but hey, the audience is usually right.
 woadwarrior011 day ago
 I'm sorry, but it's my pet peeve. If you're on iOS/macOS I built a 100% free and privacy-friendly app to get rid of tracking parameters from hundreds of different websites, not just X/Twitter.<a href="https://apps.apple.com/us/app/clean-links-qr-code-reader/id6747395062">https://apps.apple.com/us/app/clean-links-qr-code-reader/id6...</a>
 monocularvision1 day ago
 This is great! I have been meaning to implement this sort of thing in my existing Shortcuts flow but I see you already support it in Shortcuts! Thank you for this!Anywhere I can toss a Tip for this free app?
 woadwarrior011 day ago
 I'm glad you like it. :)
 sillysaurusx1 day ago
 It works on iOS? That’s cool. I’ll give it a go.
 pmarreck1 day ago
 So what is your motivation for doing this, incidentally? Can you be explicit about it? I am genuinely curious.Especially when it’s to the point of, you know, nagging/policing people to do it the way you’d prefer, when you could just redirect your router requests from x.com to xcancel.com
 woadwarrior011 day ago
 It's not particularly about x.com, hundreds of site like x, youtube, facebook, linkedin, tiktok etc surreptitious add tracking parameters to their links. The iOS Messages app even hides these tracking parameters. I don't like being surreptitiously tracked online and judging by the success of my free app, there are millions of people like me.
 pmarreck22 hours ago
 so, since these companies have to comply with removing PII, is the worst thing that could happen to me, that I get ads that are more likely to be interesting to me?i’m not being facetious, honest question, especially considering ads are the only thing paying these people these days
 Terretta12 hours ago
 Who has to comply with removing PII? Your profile, yours, mapped to a special snowflake ID, is packaged and sold across a network of 2500 - 4000 buyers, including in particular those that clean, tie (a surprisingly small footprint turns into its own "natural primary key"), qualify, and sell on to agencies. No step in this is illegal.<a href="https://www.theverge.com/2024/10/23/24277679/atlas-privacy-babel-street-data-brokers-locate-x-tracking" rel="nofollow">https://www.theverge.com/2024/10/23/24277679/atlas-privacy-b...</a>
 pmarreck9 hours ago
 my first and last name is already a "natural primary key" (every single google result of Peter Marreck is me), so I've already had to give that up a long time ago. So nothing new is lost I guess?
 rapind11 hours ago
 IMO the tracking, advertising, and attention market might just be societies biggest problem.Certainly it employs a lot of people, as do cartels.
 marcus_holmes17 hours ago
 The more data they have on you, the more valuable that data is to a third party. So they sell your data to someone else, who then phones you based on your known deep interest in <whatever it was that tracked you>. Or spams you. Or messages you. Or whatever method they think will most get your attention.If you don't give them that information, they can't sell it, and the buyers won't annoy you.It's not that the ads you get are more interesting, it's that you get more ads because they think they know more about you.
 taneq18 hours ago
 The worst thing that could happen is that you get caught in some government dragnet based on your historical viewing data and get disappeared because (as is the nature of dragnet searches) no matter how innocent you are you still look guilty.
 pnexk22 hours ago
 Helpful type of nagging for me. Most here would agree they are not a positive aspect of the modern digital experience, calling it out gently without hostility is not bad. It might not be quite self policing but some of that with good reason is not bad for healthy communities IMO.
 - lubesGordi1 day ago
 It's funny that the context window size is such a thing still. Like the whole LLM 'thing' is compression. Why can't we figure out some equally brilliant way of handling context besides just storing text somewhere and feeding it to the llm? RAG is the best attempt so far. We need something like a dynamic in flight llm/data structure being generated from the context that the agent can query as it goes.
 - le-mark21 hours ago
 That’s actually a pretty cool idea. When I think about my internal mental model of a codebase I’m working on it’s definitely a compacted lossy thing that evolves as I learn more.
 - nowittyusername1 day ago
 Personally what I am more interested about is effective context window. I find that when using codex 5.2 high, I preferred to start compaction at around 50% of the context window because I noticed degradation at around that point. Though as of a bout a month ago that point is now below that which is great. Anyways, I feel that I will not be using that 1 million context at all in 5.4 but if the effective window is something like 400k context, that by itself is already a huge win. That means longer sessions before compaction and the agent can keep working on complex stuff for longer. But then there is the issue of intelligence of 5.4. If its as good as 5.2 high I am a happy camper, I found 5.3 anything... lacking personally.
 - gck121 hours ago
 Not sure how accurate this is, but found contextarena benchmarks today when I had the same question.It appears only gemini has actual context == effective context from these. Although, I wasn't able to test this neither in gemini cli, nor antigravity with my pro subscription because, well, it appears nobody actually uses these tools at Google.<a href="https://contextarena.ai/?showLabels=false" rel="nofollow">https://contextarena.ai/?showLabels=false</a>
 - Someone12341 day ago
 That's an interesting point regarding context Vs. compaction. If that's viewed as the best strategy, I'd hope we would see more tools around compaction than just "I'll compact what I want, brace yourselves" without warning.Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.
 - thyb231 day ago
 This is exactly how it should work. I imagine it as a tree view showing both full and summarized token counts at each level, so you can immediately see what’s taking up space and what you’d gain by compacting it.The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.That way you stay in control of both the context budget and the level of detail the agent operates with.
 - joquarky1 day ago
 I compact myself by having it write out to a file, I prune what's no longer relevant, and then start a new session with that file.But I'm mostly working on personal projects so my time is cheap.I might experiment with having the file sections post-processed through a token counter though, that's a great idea.
 - Folcon1 day ago
 I do find it really interesting that more coding agents don't have this as an toggleable feature, sometimes you really need this level of control to get useful capability
 Someone12341 day ago
 Yep; I've actually had entire jobs essentially fail due to a bad compaction. It lost key context, and it completely altered the trajectory.I'm now more careful, using tracking files to try to keep it aligned, but more control over compaction regardless would be highly welcomed. You don't ALWAYS need that level of control, but when you do, you do.
 - joshvm22 hours ago
 Have you tried writing that as a skill? Compaction is just a prompt with a convenient UI to keep you in the same tab. There's no reason you can't ask the model to do that yourself and start a new conversation. You can look up Claude's /compact definition, for reference.However, in some harnesses the model is given access to the old chat log/"memories", so you'd need a way to provide that. You could compromise by running /compact and pasting the output from your own summarizer (that you ran first, obviously).
 - mindplunge15 hours ago
 Frontend work with large component libraries. When I'm refactoring shared design system components, things like a token system that touches 80+ files, compaction tends to lose the thread on which downstream components have already been updated vs which still need changes. It ends up re-doing work or missing things silently.The model holds "what has been updated" well at the start of a session. After compaction, it reconstructs from summaries, and that reconstruction is lossy exactly where precision matters most: tracking partially-complete cross-file operations.1M context isn't about reading more, it's about not forgetting what you already did halfway through.
 - dahcryn15 hours ago
 I would like to counteract your statement that each token adds a distraction.In our experiments, we see a surprising benefit to rewriting blocks to use more tokens, especially long lists etc..E.g. compare these two options"The following conditions are excluded from your contract - condition A - condition B ... - condition Z"The next one works better for us:"The following conditions are excluded from your contract - condition A is excluded - condition B is excluded ... - condition Z is excluded"And we now have scripts to rewrite long documents like this, explicitly adding more tokens. Would you have any opinion on this?
 - mnicky14 hours ago
 This observation makes sense, because all models currently probably use some kind of a sparse attention architecture.So the closer the two related pieces of information are to each other in the input context, the larger the chance their relationship will be preserved.
 - jmward0118 hours ago
 What needs to be an option is to allow complete and then compact and if needed go into the 1m version. That way you can get the most out of the shorter window but in the case where it just couldn't finish and compact in time it will (at cost) go over. I wonder how many tokens are actually left at the end of compaction on average. I know there have been many times where I likely needed just another 10-20k and a better stopping point would have been there.
 - oidar8 hours ago
 context distillation mostly. Agents tend to report success too early if they find something close to what they need for the task. If you are able to shove it in a 1M context, it's impossible for them to give up looking, it's in the context. But for actual implementation, it's not useful at all. They get derailed with too long of a context.
 - asabla1 day ago
 I really don't have any numbers to back this up. But it feels like the sweet spot is around ~500k context size. Anything larger then that, you usually have scoping issues, trying to do too much at the same time, or having having issues with the quality of what's in the context at all.For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.
 - gspetr1 day ago
 I have found a bigger context window qute useful when trying to make sense of larger codebases. Generating documentation on how different components interact is better than nothing, especially if the code has poor test coverage.I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.
 - neom22 hours ago
 On Claude Code (sorry) the big context window is good for teams. On CC if you hit compact while a bunch of teams working it's a total shit show after.
- andai1 day ago
 It's a little hard to compare, because Claude needs significantly fewer tokens for the same task. A better metric is the cost per task, which ends up being pretty similar.For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)
 - andai23 hours ago
 Looks like the same thing might apply to GPT-5.4 vs the previous GPTs:>In the API, GPT‑5.4 is priced higher per token than GPT‑5.2 to reflect its improved capabilities, while its greater token efficiency helps reduce the total number of tokens required for many tasks.I eagerly await the benchies on AA :)
 - andai4 hours ago
 Benchies update:<a href="https://artificialanalysis.ai/" rel="nofollow">https://artificialanalysis.ai/</a>Looks like it costs ~25% more than 5.2, with both on xhigh reasoning.They only seem to have tested xhigh, which is a shame, since I think that reasoning level is in the point of diminishing returns for most tasks.Also I was completely wrong earlier. Opus is significantly more expensive. I was looking at the wrong entry in the chart, the non-reasoning version of Opus. The fair comparison is Opus on max reasoning, which costs about twice the price of GPT-5.4 xhigh, to run the AA evals.
 - hagen816 hours ago
 But does it use the same agent harness? Because the harness determines the behavior a lot.
- netinstructions1 day ago
 People (and also frustratingly LLMs) usually refer to <a href="https://openai.com/api/pricing/" rel="nofollow">https://openai.com/api/pricing/</a> which doesn't give the complete picture.<a href="https://developers.openai.com/api/docs/pricing" rel="nofollow">https://developers.openai.com/api/docs/pricing</a> is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272kIt is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)
 - Flashtoo1 day ago
 > Prompts with more than 272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
 - netinstructions1 day ago
 Thanks, it looks like the pricing page keeps getting updated.Even right now one page refers to prices for "context lengths under 270K" whereas another has pricing for "<272K context length"
- smusamashah1 day ago
 Gemini already has 1M or 2M context window right?
 - rezonant15 hours ago
 Yes, 1M context window since Gemini 1.5 Pro first previewed in February 2024.
 - danenania8 hours ago
 Gemini 1.5 Pro actually has 2M!No other model from a major lab has matched it since afaik.Edit: err, I see in the comment below mine that Grok has 2M as well. Had no idea!
- peterspath18 hours ago
 Grok has a 2M context window for most of their models.For example their latest model `grok-4-1-fast-reasoning`:- Context window: 2M- Rate limits: 4M tokens per minute, 480 requests per minute- Pricing: $0.20/M input $0.50/M outputGrok is not as good in coding as Claude for example. But for researching stuff it is incredible. While they have a model for coding now, did not try that one out yet.<a href="https://docs.x.ai/developers/models" rel="nofollow">https://docs.x.ai/developers/models</a>
 - aurareturn16 hours ago
 What kind of research do you use it for?
- dev_l1x_be11 hours ago
 Based on my experience with LLMs the larger your input context the bigger the chance of something going sideways in the response. Not sure how to address this properly.
- luca-ctx1 day ago
 Context rot is definitely still a problem but apparently it can be mitigated by doing RL on longer tasks that utilize more context. Recent Dario interview mentions this is part of Anthropic’s roadmap.
- fvv14 hours ago
 imo , the main feature is /fast ... who use 1M context and for what? the model become dumber already at 200K.. it's better to manage the context , and since 5.3, codex is very good at managing it
- thehamkercat1 day ago
 GPT 5.3 codex had 400K context window btw
- AtreidesTyrant1 day ago
 token rot exists for any context window at above 75% capacity, thats why so many have pushed for 1 mil windows
- simianwords1 day ago
 Why would some one use codex instead?
 - lmeyerov1 day ago
 In our evals for answering cybersecurity incident investigation questions and even autonomously doing the full investigation, gpt-5.2-codex with low reasoning was the clear winner over non-codex or higher reasoning. 2X+ faster, higher completion rates, etc.It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.Video: <a href="https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-team-ctfs-with-ai-speed-runs" rel="nofollow">https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...</a>We'll be updating numbers on 5.3 and claude, but basically same thing there. Early, but we were surprised to see codex outperform opus here.
 - jeswin1 day ago
 When it comes to lengthy non-trivial work, codex is much better but also slower.
 - surgical_fire1 day ago
 I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.
 - hnsr1 day ago
 > I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).Exact same situation here. I've been using both extensively for the last month or so, but still don't really feel either of them is much better or worse. But I have not done large complex features with it yet, mostly just iterative work or small features.I also feel I am probably being very (overly?) specific in my prompts compared to how other people around me use these agents, so maybe that 'masks' things
 - joquarky1 day ago
 > overly specificI have a hypothesis that people who have patience and reasonably well-developed written language skills will scratch their heads at why everyone else is having so much difficulty.
 - simianwords1 day ago
 No my question was why would I use codex over gpt 5.4
 - surgical_fire1 day ago
 Ahh, good question. I misunderstood you, apologies.There's no mention of pricing, quotas and so on. Perhaps Codex will still be preferable for coding tasks as it is tailored for it? Maybe it is faster to respond?Just speculation on my part. If it becomes redundant to 5.4, I presume it will be sunset. Or maybe they eventually release a Codex 5.4?
 landtuna1 day ago
 5.3 Codex is $1.75/$14, and 5.4 is $2.50/$15.
 surgical_fire1 day ago
 There you go. It makes perfect sense to keep it around then.
 - athrowaway3z1 day ago
 They perform at a somewhat equal level on writing single files. But Codex is absolute garbage at theory of self/others. That quickly becomes frustrating.I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.You could add more scaffolding to fix this, but Claude proves you shouldn't have to.I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.
 - surgical_fire1 day ago
 > They perform at a somewhat equal level on writing single files.That's not the experience I have. I had it do more complex changes spawning multiple files and it performed well.I don't like using multiple agents though. I don't vibe code, I actually review every change it makes. The bottleneck is my review bandwidth, more agents producing more code will not speed me up (in fact it will slow me down, as I'll need to context switch more often).
 - synergy201 day ago
 in my testing codex actually planned worse than claude but coded better once the plan is set, and faster. it is also excellent to cross check claude's work, always finding great weakness each time.
 - pmarreck1 day ago
 That’s why I think the sweet spot is to write up plans with Claude and then execute them with Codex
 - meowface5 hours ago
 Correct, this is the way. A year or two ago lots of people were saying to do the opposite, but at least now and probably also even then, this is better. Claude is a more sensible and holistic designer, planner, debater, and idea generator. Codex is better at actually correctly implementing any large codebase change in a single pass.
 - GorbachevyChase1 day ago
 Weird. It used to be the opposite. My own experience is that Claude’s behind-the-scenes support is a differentiator for supporting office work. It handles documents, spreadsheets and such much better than anyone else (presumably with server side scripts). Codex feels a bit smarter, but it inserts a lot of checkpoints to keep from running too long. Claude will run a plan to the end, but the token limits have become so small in the last couple months that the $20 pla basically only buys one significant task per day. The iOS app is what makes me keep the subscription.
 - joquarky1 day ago
 And it fits well with the $20 plans for each since Codex seems to provide about 7-8x more usage than Claude.
 - embedding-shape1 day ago
 Why would someone use Claude Code instead? Or any other harness? Or why only use one?My own tooling throws off requests to multiple agents at the same time, then I compare which one is best, and continue from there. Most of the time Codex ends up with the best end results though, but my hunch is that at one point that'll change, hence I continue using multiple at the same time.
- paulddraper1 day ago
 I don’t know about 5.4 specifically, but in the past anything over 200k wasn’t that great anyway.Like, if you really don’t want to spend any effort trimming it down, sure use 1m.Otherwise, 1m is an anti pattern.
Philip-J-Fry1 day ago
I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post followed by "summarise this blog post". Only to be told "I can't access external URLs directly, but if you can paste the relevant text or describe the content you're interested in from the page, I can help you summarize it. Feel free to share!"That's hilarious. Does OpenAI even know this doesn't work?
- andrewguenther1 day ago
 It looks like this doesn't work for users without accounts? It works when I'm logged in, but not logged out. I went ahead and reported it to the team. Thanks for letting us know!
 - dotancohen20 hours ago
 No integration test for guest (non-logged in) users?Hahaha who am I kidding. No integration tests for anybody!
 - Rohunyyy19 hours ago
 SDET here. A year ago when AI came into play SDET/QA roles started disappearing. People were like oh ya anyone can write tests. Then with the recent fiascos about outages and what not, I am seeing the SDE roles are disappearing and SDET roles are going back up?! Apparently AI is good at writing applications but you still need someone to make sure it is doing the right things.
 - DrewADesign19 hours ago
 It’s not really good at writing the software either — it’s a moderate to decent productivity booster in an uneven, difficult-to-predict assortment of tasks. Companies are just starting to exit the “we’re still trying to figure this out” grace period. Expect more of that as soon as these chatbot companies have to start charging enough to pull in more money than they spend. I foresee some purpose-built models that are pretty lean being much more useful in long run. It’s neat that the bot which can one-shot a simple CRUD website for you can also crank out Scrubs-based erotic fan fiction novellas by the dozen but I don’t foresee that being a sustainable business model. Having good purpose-built tools is, in my opinion, better than some unwieldy tool that can do a whole bunch of shit I don’t need it to.
 - dotancohen19 hours ago
 Interestingly, the first real productive use of AI that I found was writing the unit tests and integration tests for my applications. It was much better at thinking about corner cases that I was.
 - democracy20 hours ago
 integration tests? so last century....
 - k4rli10 hours ago
 "You're absolutely right! I understand the assignment completely. Now let me delete the blog post."
 - ulfw19 hours ago
 But but but but I thought AI would do this magically for all of us, no?No more need for pesky humans, no?
 - curiousgal17 hours ago
 Tell them to stop being evil while you're at it.
- baxtr1 day ago
 I picked up Claude today after being away and using only ChatGPT and Gemini for a while.I was pretty impressed with how they’ve improved user experience. If I had to guess, I’d say Anthropic has better product people who put more attention to detail in these areas.
 - gizmodo5923 hours ago
 ChatGPT has given more for my 20$ than any other vendor. And that’s not even considering codex which is so good and the limits are much much higher
 - manojlds20 hours ago
 How is that relevant? Also, when you are behind you do give more usage
 - triage800421 hours ago
 They are all losing money on probably all levels of the packages if you max them out
 - bwat4921 hours ago
 yeah claude is great... but only if you pay $100-$200 a month
 - devld7 hours ago
 I'm using it via Copilot, now considering to also try Open Code (with Copilot license). I don't know if it's as good as Claude Code, but it's pretty good. You get 100 Sonnet requests or 33 Opus request in the subscription per month ($20 business plan) + some less powerful models have no limits (i.e. GPT 4.1), while extra Sonnet request is $0.04 and Opus $0.12, so another $20 buys 250 Sonnet requests + 83 Opus requests. This works for me better since I do not code all day, every single day. Also a request is a request, so it does not matter if it's just a plain edit task or an agent request, it costs the same.Btw. I trust Microsoft / GitHub to not train on my data more (with the Business license) than I would trust Antrophic.
 - beefsack19 hours ago
 Many people buy two separate Claude pro subscriptions and that makes the limit become a non-issue. It works surprisingly well when you tend to hit the 5 hourly limit after a few hours, and hit the weekly limit after 4-5 days. $40 vs $100 is significant for a lot of people.
 ruszki13 hours ago
 I hit limit of Pro in about 30 minutes, 1 hour max. And only when I use a single session, and when I don't use it extensively, ie waits for my responses, and I read and really understand what it wants, what it does. That's still just 1-2 hours/5 hours.What do you do to avoid that?
 AlexeyBelov12 hours ago
 You're probably having long sessions, i.e. repeated back-and-forth in one conversation. Also check if you pollute context with unneeded info. It can be a problem with large and/or not well structured codebases.
 ruszki11 hours ago
 The last time I used pro, it was a brand new Python rest service with about 2000 lines generated, which was solely generated during the session. So how I say to Claude that use less context, when there was 0 at the beginning, just my prompt?
 nevertoolate10 hours ago
 So you had generated 2000 lines in 30 minutes and ran out of tokens? What was your prompt?I’d use a fast model to create a minimal scaffold like gemini fast.I’d create strict specs using a separate codex or claude subscription to have a generous remaining coding window and would start implementation + some high level tests feature by feature. Running out in 60 minutes is harder if you validate work. Running out in two hours for me is also hard as I keep breaks. With two subs you should be fine for a solid workday of well designed and reviewed system. If you use coderabbit or a separate review tool and feed back the reviews it is again something which doesn’t burn tokens so fast unless fully autonomous.
 smartbit19 hours ago
 Thanks for the tip, didn’t think of using 2 subscriptions at the same company.When reaching a limits, I switch to GLM 4.7 as part of a subscription GLM Coding Lite offered end 2025 $28/year. Also use it for compaction and the like to save tokens.
 - nerdsniper19 hours ago
 To be honest it feels very worth my $200/mo. And I “only” make $80k/year. I used to have two ChatGPT subs but Claude is just so much better.
 - abustamam23 hours ago
 I agree! I recently migrated from ChatGPT to Claude and it is just superior in every way. It doesn't blather on the at the end ask me for clarification. It's succinct and clarifies vital information before providing a solution.
 - vostrocity20 hours ago
 Voice input is still far less accurate than OpenAI's unfortunately, otherwise I would have already switched.
 - abustamam7 hours ago
 Oh interesting. I've never used voice input on either so I can't comment, but understandable why you can't switch if it's disruptive to your workflow to do so.
 - beachy22 hours ago
 I held off migrating from ChatGPT to Claude Code due to being a laggard that lived in the Eclipse world. I didn't believe what I was told that I wouldn't be writing code any more. Pushed into action by recent PR gaslighting from OpenAI, I jumped to claude code and they were right - I barely venture into the IDE now and certainly don't need an integration.
 - hamasho20 hours ago
 I agree, but in general those chat apps have relatively bad user experiences for multibillion BtoC company. I used to have a lot of surprises and frustrations while using Claude Code / Desktop, and still encounter issues, but it's the best in major LLM services.
 - majormajor20 hours ago
 It's funny cause, you know, fixing all those little nitty gritty things should be practically automatic with their own offerings... have your agent put in a lot of instrumentation... have it chase down bugs or dead-end user-journeys... have it go make the changes to fix it...I've seen these tools work for this kinda stuff sometimes... you'd think nobody would be better at it than the creators of the tools.
 - sreekanth85020 hours ago
 True. Everytime when i ask something gpt, it use to spit out long stories. Claude ans gemini are always straight to point.
 - forgotpwd163 hours ago
 Seems not very known that ChatGPT got a few style/tone choices besides default. One is specifically being concise and plain.
 - twelvedogs20 hours ago
 I bullied it into giving me concise answers, now it starts every answer with "just quickly" or something similar but it gets straight to the point
 - sreekanth85013 hours ago
 I always add no nonsense no bullshit at the end of my prompt. Its annoying how itries to please the user.
 forgotpwd163 hours ago
 No need to do it yourself in every prompt. Just put it in Custom instructions under Personalization.
- kgeist12 hours ago
 I had something similar happen with skills today. A popup appeared saying, "hey, did you know ChatGPT has skills?" Clicking on it opened a new chat window, and after some thinking it said, "I tried to launch the built-in skills demo flow, but it isn’t available".They barely test this stuff.
 - rapind11 hours ago
 > They barely test this stuff.In all fairness they are more focused on domestic surveillance these days.
 - DonsDiscountGas10 hours ago
 They're testing it in production apparently. With release cycles this fast there's no other way.
- ElijahLynn1 day ago
 fwiw: I get a valid response when following the steps you mentioned. I do not get the message you mentioned:<a href="https://chatgpt.com/share/69aa0321-8a9c-8011-8391-22861784e8b2" rel="nofollow">https://chatgpt.com/share/69aa0321-8a9c-8011-8391-22861784e8...</a>EDIT: oh, but I'm logged in, fwiw
- zamadatix1 day ago
 Following this process summarizes the blogpost for me. Perhaps the difference is I'm signed into my account so it can access external URLs or something of that nature?
- rishikeshs9 hours ago
 This is not only openai, but other models as well. Last week I added a summarise with AI block on a product blog page. I had seen it somewhere and felt like it’s a cool feature to have. Wrote a small shortcode in hugo for the block and added it with various models.It’s like a hit and miss, sometimes claude says i cannot access your site which is not true.Ref: <a href="https://formbeep.com/blog/building-formbeep-weekend/" rel="nofollow">https://formbeep.com/blog/building-formbeep-weekend/</a>
- beambot21 hours ago
 It's like opening copilot in a word doc and it telling you it can't see the document in its context
 - reval20 hours ago
 This is infuriating. However, for those in this situation, know this: it works if the document or spreadsheet is in OneDrive. I just wish Copilot told you this instead of asking you to upload the doc.
- martin_drapeau7 hours ago
 In Codex I was suggested to try Codex Spark for a limited time. So for my next session, I gave it a shot. It is much, much faster. However on the task I gave it, it spun around in circles cycling through files and finally abandoned saying it ran out of tokens. Major fail.
- amelius1 day ago
 If only they had an LLM they could use as a software testing agent.
 - kennywinker20 hours ago
 I think you might have hit on the issue - just the wrong way around. I would assume they’re using LLMs for testing, and no humans or maybe just one overworked human, and that is the problem
- pocksuppet1 day ago
 Most AI integration is like this. It's not about building working products --- it's about bragging that you put a chatbox in your program.
 - bartread1 day ago
 This is such a stale take. In the past 3 years I’ve worked on multiple products with AI at their core, not as some add-on. Just because the corpo-land dullards[0] can’t execute on anything more complex than shoehorning a chatbot into their offerings doesn’t mean there aren’t plenty of people and companies doing far more interesting things.[0] In this case, and with heavy irony, including OpenAI, although it sounds like most of this particular snafu is due to a bug.
 - abustamam23 hours ago
 Kinda reminds me of crypto. There are certainly very interesting things happening in the crypto space. But the most visible parts of the crypto universe are the stupid parts (buying PNGs for millions, for example)
 - thereticent22 hours ago
 Genuinely curious, not being combative...what very interesting things have happened in the crypto space lately?
 abustamam21 hours ago
 Oh, I dunno about lately (though I did stumble upon <a href="https://a16zcrypto.com/posts/article/big-ideas-things-excited-about-crypto-2026/" rel="nofollow">https://a16zcrypto.com/posts/article/big-ideas-things-excite...</a> )But when I was in the crypto space in 2018, there was a lot of interesting things happening in the smart contract world (like proofs of concepts of issuing NFTs as a digital "deed" to a physical asset like a house).I don't think any of those novel ideas went anywhere, but it was a fun time to be experimenting.
 ulfw17 hours ago
 > like proofs of concepts of issuing NFTs as a digital "deed" to a physical asset like a housewhich went absolutely nowhere
 abustamam7 hours ago
 Yeah, like most startups. I'd argue that a majority of AI startups now will go nowhere as well. That's just how new technology goes. Lots of shiny objects, lots of hype, and maybe 1%, if that, goes on to become a foundation of society.Jury is still out on if crypto will become a foundation for society (if anything, it would be foundational for something boring and invisible like banking). I wouldn't bet on a startup doing that, but that's the only viable thing I can foresee crypto being useful for. But it doesn't mean that other applications can't be interesting and useless!
 - saghm23 hours ago
 > Most AI integration is like this.>> This is such a stale take. In the past 3 years I’ve worked on multiple products with AI at their core, not as some add-on. Just because the corpo-land dullards[0] can’t execute on anything more complex than shoehorning a chatbot into their offerings doesn’t mean there aren’t plenty of people and companies doing far more interesting things.I feel like this is just a disagreement of what "AI integration" means. You seem to agree that the trend they're describing exists, but it sounds like you're creating new products, not "integrating" it into existing ones.
 - LordDragonfang23 hours ago
 I mean, to be fair, both things can be technically true. There can be lots of interesting things being done, even while most can be low-effort garbage.But this is just Sturgeon's Law (ninety percent of everything is crap), not an actually insightful addition to the discussion, and I very much agree it's a stale take.
- Aurornis1 day ago
 Probably intentional. They don't want open, no-registration endpoints able to trigger the AI into hitting URLs.
 - jazzypants1 day ago
 But, why include the non-functional chat box in the article?
 - embedding-shape1 day ago
 Different team "manages" the overall blog than the team who wrote that specific article. At one point, maybe it made sense, then something in the product changed, team that manages the blog never tested it again.Or, people just stopped thinking about any sort of UX. These sort of mistakes are all over the place, on literally all web properties, some UX flows just ends with you at a page where nothing works sometimes. Everything is just perpetually "a bit broken" seemingly everywhere I go, not specific to OpenAI or even the internet.
 - colonCapitalDee1 day ago
 That's why it happened. It still shouldn't have happened.
 - sumedh12 hours ago
 > team that manages the blog never tested it again.They can use this new tech called AI to test it.
 - ethbr11 day ago
 > Or, people just stopped thinking about any sort of UX. These sort of mistakes are all over the place, on literally all web properties, some UX flows just ends with you at a page where nothing works sometimes.It's almost like people are vibe coding their web apps or something.
 - teaearlgraycold1 day ago
 If only there was some kind of way to automatically test user flows end to end. Perhaps testing could be evaluated periodically, or even ran for each code change.
 koakuma-chan1 day ago
 There is no business value in doing that.
 teaearlgraycold23 hours ago
 There most certainly is, but maybe the time spent on it could be better allocated to something else.
 koakuma-chan23 hours ago
 Yeah, like adding more features.
 teaearlgraycold40 minutes ago
 Sometimes I’d pay for them to remove features.
 - observationist1 day ago
 They're having service issues - ChatGPT on the web is broken for a lot of people. The app is working in android - I'd assume that the rollout hit a hitch and the chatbox in the article would normally work.
 - jdndbdjsj1 day ago
 Welcome to a big company
 - AirGapWorksAI1 day ago
 Welcome to a big company where pretty much everyone has been working full steam for years, in order to take advantage of having a job at a company during a once-in-a-lifetime moment.
 - ionwake1 day ago
 [flagged]
 - m3kw91 day ago
 what? it's their own site and own llm. I could paste most sites and it would work.
- judge20201 day ago
 Works for me: <a href="https://rr.judge.sh/Labradorretriever/d6af05/chrome_j9rXJMlfXgBv.png" rel="nofollow">https://rr.judge.sh/Labradorretriever/d6af05/chrome_j9rXJMlf...</a>
- netdur1 day ago
 Did it complain about copyright issues?
- Razengan9 hours ago
 As bad as Google Gemini telling me it couldn't search Google Flights or Google reverse image search for me. These companies really need to dogfood their own products first. Do they not realize how embarrassing it is when their flagship intelligence refuses to interop with their own services?
- mempko19 hours ago
 vibe coded. But vibes are off
- peab23 hours ago
 LOL - yes Sam, AGI is near indeed. (sarcasm)
- within_will1 day ago
 [flagged]
creamyhorror1 day ago
I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Codex. It feels very lucid and uses human phrasing.It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.
- startages15 hours ago
 I thought I had something wrong within my setup, I could never use Codex 5.3 while everyone else was praising it. It uses some weird terms and complex jargon and doesn't really make it clear what it was doing or planning to do unlike Opus which makes things clear, this allows me to give accurate feedback and change plans and make proper decision.
- joegibbs21 hours ago
 The weird phrasing was my biggest gripe with 5.3 so I'm glad they've fixed that up. It couldn't say anything without a heap of impenetrable jargon and it was obsessed with the word "drive". Nothing could cause anything, it had to be "driven".
- torginus23 hours ago
 Honestly, while I'd like to believe you, there's always a post about how $MODEL+1 delivered powerful insights about the very nature of the universe in precise Hegelian dialectic, while $MODEL's output was indistinguishable from a pack of screeching sexually frustrated bonobos
 - Der_Einzige7 hours ago
 Hegel is a mind virus. Let his terrible thought rest in peace.
- sampton1 day ago
 That's been my experience as well switching from Opus to Codex. Reasoning takes longer but answers are precise. Claude is sloppy in comparison.
 - solenoid09371 day ago
 Weird, I have had the opposite experience. Codex is good at doing precisely what I tell it to do, Opus suggests well thought out plans even if it needs to push back to do it.
 - slopinthebag22 hours ago
 This is just the stochastic nature of LLM's at play. I think all of the SOTA models are roughly equivalent, but without enough samples people end up reading into it too much.
 - oorza16 hours ago
 There's a certain amount of variance in the way that people utilize these agents. Put five people in a room and ask them to compose the same prompt and you have five distinct prompts. Couple this with the fact that models respond better/worse to certain prompts depending on the stylistic composition of the prompt itself. And since people tend to write in the same style, you'd get people who have more luck with one model over another, where one model happens to align more readily with their prompt style.To wit, I have noticed that I tend to prefer Codex's output for planning and review, but Opus for implementation; this is inverted from others at work.
 ruszki13 hours ago
 > Couple this with the fact that models respond better/worse to certain prompts depending on the stylistic composition of the prompt itself.Do we really know this, or is it just gut feeling? Did somebody really proved this statistically with a great certainty?
 - meowface5 hours ago
 I used to feel like you do, but I don't agree. I would just say it is not consistent. For a given codebase and given goal, sometimes Claude will be the more sensible, creative, thoughtful planner and sometimes Codex will be, sometimes Claude will make a serious oversight that Codex catches and sometimes the opposite. But the trend for me and seemingly a lot of people is that Claude is a more "human-like/human-smart" planner than Codex (in a positive way) but is more likely to make mistakes or forget details when implementing major codebase changes.
 - throwaway9112821 day ago
 codex has been really good so far and the fast mode is cherry on top! and the very generous limits is another cherry on top
 - slopinthebag22 hours ago
 It's well worth the $20 to not deal with any limits and have it handle all the boilerplate repetitive BS us programmers seem forced to deal with. I think 80% of the benefit comes from spending that $20 (20%? :P) and just having it do the lame shit that we probably shouldn't have to do but somehow need to.
- dana32123 hours ago
 5.4 very high didn't notice in my codebase a glaring issue that drops all data being sent around the network.
- irishcoffee1 day ago
 > It might be my AGENTS.md requiring clearer, simpler languageIf you gave the exact same markdown file to me and I posted ed the exact same prompts as you, would I get the same results?
 - creamyhorror1 day ago
 I'm not sure if the model (under its temperature/other settings) produces deterministic responses. But I do think models' style and phrasing are fairly changeable via AGENTS.md-style guidelines.5.4's choice of terms and phrasing is very precise and unambiguous to me, whereas 5.3-Codex often uses jargon and less precise phrases that I have to ask further about or demand fuller explanations for via AGENTS.md.
 - irishcoffee1 day ago
 So sharing markdown files is functionally useless, or no?
 - Tarq0n15 hours ago
 No it's just stochastic like everything about LLMs. The md file will bias results towards a certain set of outcomes.
 - m3kw91 day ago
 you probably can't and asking agents.md to "make it clearer" will likely give you the illusion of clearer language without actual well structured tests. agents.md is to usually change what the llm should focus on doing more that suits you. Not to say stuff like "be better", "make no mistakes"
- pembrook1 day ago
 The latest research these days is that including an AGENTS.md file only makes outcomes worse with frontier models.
 - netcraft1 day ago
 I think its understandable that you took that from the click-bait all over youtube and twitter, but I dont believe the research actually supports that at all, and neither does my experience.You shouldnt put things in AGENTS.md that it could discover on its own, you shouldnt make it any larger than it has to be, but you should use it to tell it things it couldnt discover on its own, including basically a system prompt of instructions you want it to know about and always follow. You don't really have any other way to do those things besides telling it every time manually.
 - FINDarkside1 day ago
 I wouldn't draw such conclusions from one preprint paper. Especially since they measured only success rate, while quite often AGENTS.md exists to improve code quality, which wasn't measured. And even then, the paper concluded that human written AGENTS.md raised success rates.
 - solarkraft1 day ago
 From what I remember, this was for describing the project’s structure over letting the model discover it itself, no?Because how else are you going to teach it your preferred style and behavior?
 - joquarky23 hours ago
 I still find it valuable.AGENTS.md is for top-priority rules and to mitigate mistakes that it makes frequently.For example:- Read `docs/CodeStyle.md` before writing or reviewing code- Ignore all directories named `_archive` and their contents- Documentation hub: `docs/README.md`- Ask for clarifications whenever neededI think what that "latest research" was saying is essentially don't have them create documents of stuff it can already automatically discover. For example the product of `/init` is completely derived from what is already there.There is some value in repetition though. If I want to decrease token usage due to the same project exploration that happens in every new session, I use the doc hub pattern for more efficient progressive discovery.
 - pizlonator22 hours ago
 FWIW, I haven't been using AGENTS.md recently - instead letting the model explore the codebase as needed.Works great
 - madeofpalk1 day ago
 :(how can i get claude to always make sure it prettier-s and lints changes before pushing up the pr though?
 - mckirk1 day ago
 I think what that research found is that _auto-generated_ agent instructions made results slightly worse, but human-written ones made them slightly better, presumably because anything the model could auto-generate, it could also find out in-context.But especially for conventions that would be difficult to pick up on in-context, these instruction files absolutely make sense. (Though it might be worth it to split them into multiple sub-files the model only reads when it needs that specific workflow.)
 - JofArnold1 day ago
 Run prettier etc in a hook.
 - emsimot1 day ago
 Git hooks
 - slopinthebag22 hours ago
 > do nothing because can't be arsed> somehow is the optimal strategyMy strategy of not spending an ounce of effort learning how to use AI beyond installing the Codex desktop app and telling it what to do keeps paying off lol.
AmazingTurtle15 hours ago
I just tried that in Codex CLI. With /fast mode enabled. Observations:1. Fast mode ain't that fast2. Large context * Fast * Higher Model Base Price = 8x increase over gpt-5.3-codex3. I burnt 33% of my 5h limit (ChatGPT Business Subscription) with a prompt that took 2 minutes to complete.
- jstummbillig12 hours ago
 > 8x increase over gpt-5.3-codexHow do you arrive at that number? I find it hard to make sense of this ad hoc, given that the total token cost is not very interesting; it's token efficiency we care about.
 - AmazingTurtle8 hours ago
 > prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.which is basically maxxed out quickly. So there is 2x (the first lever)Then there is the /fast mode, which they state costs 2x more (for 1.5x speedup)And then there is the model base price ($2.50 vs $1.75), well yeah thats 42% increase. It is in fact a 5.7x total increase of token cost in fast mode and large context. (Sorry for the confusion, I thought it was 8x because I thought gpt-5.3-codex was $1.25)
 - jstummbillig48 minutes ago
 (After a day of usage, I am relatively certain in practice this does not end up being a 5.7x cost increase or anything close to that, though I am still fairly unclear on what that computation is worth to begin with, given that I am entirely fine with the model using the least amount of tokens possible to get the job done)
- fvv12 hours ago
 1. it's 1.5x , it's quite fast for the level of thinking it has2. no if you are on subscription, it's the same, at 20$ codex 5.4 xhigh provide way more than 20$ opus thinking ( this one instead really can burn 33% with 1 request, try to compare then on same tasks ) also 8x .. ??? if you need 1M token for a special tasks doesn't hit /fast and vice-versa , the higher price doesn't apply on subscription too..3. false, i'm on pro , so 10x the base , always on /fast (no 1M), and often 2 parallel instances working.. hardly can use 2% (=20% of 5h limit , in 1h of work ( about 15/20 req/hour) ) , claude is way worse on that imo
 - fvv12 hours ago
 20 req/hour is 1 req every 3 min.. you have to think a bit and then write the requests..
Alifatisk1 day ago
So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it would route to the best suitable model. This worked great I assume and made the ui for the user comprehensible. But now, they are starting to introduce more of different models again?We got:- GPT-5.1- GPT-5.2 Thinking- GPT-5.3 (codex)- GPT-5.3 Instant- GPT-5.4 Thinking- GPT-5.4 ProWho’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.The good news here is the support for 1M context window, finally it has caught up to Gemini.
- applfanboysbgon19 hours ago
 The real problem that OpenAI had was that their model naming was completely incomprehensible. 4.5, o3, 4o, 4.1 which is newer than 4.5. It was a complete clusterfuck. The blowback on that issue seems to have led them to misidentify the issue, but nobody was really asking for a single router model. Having a number of sequentially numbered and clearly labelled models is not actually a problem.
 - salomonk_mur16 hours ago
 Having both o4 and 4o. Really. What the fuck?
 - weird-eye-issue13 hours ago
 There was no o4.
 - scottmf10 hours ago
 There was o4 mini and 4o mini at least
 mrandish3 hours ago
 I just don't understand how this happens. Either there's literally no product management at a cross-product level or there is and they had a meeting where this plan was discussed and someone approved it.I'm not sure which would be more shocking, especially considering it's a decade old multi-billion dollar company paying top salaries.
 - raghavtoshniwal10 hours ago
 There was o4-mini and 4o-mini
- jstummbillig15 hours ago
 > Who’s to blame for this ridiculous path they are taking?Variability, different pressures and fast progress. What's your concrete idea for how to solve this, without the power of hindsight?For example, with the codex model: Say you realize at some point in the past that this could be a thing, a model specifically post-trained for coding, which makes coding better, but not other things. What are they supposed to do? Not release it, to satisfy a cleaner naming scheme?And if then, at a later point, they realize they don't need that distinction anymore, that the technique that went into the separate coding model somehow are obsolete. What option do you have other than dropping the name again?As someone else pointed out, the previous problems were around very silly naming pattern. This sems about as descriptive as you can get, given what you have.
- weird-eye-issue18 hours ago
 > I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.Yeah having Auto selected is really destroying my cognitive load...
 - hrhdirfif5 hours ago
 If you find that auto is doing a good job, your expectations must be so low and you must be so uncritical
- lm2846913 hours ago
 You can't keep asking for 100b every 6 months if you don't give the impression of progress
- sothatsit1 day ago
 I much prefer this, we can choose based on our use-cases, and people who don’t care can still use Auto.
- aurareturn16 hours ago
 <pre><code> Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load. </code></pre> Most people have it on auto select I'm assuming so this is a non issue. They keep older models active likely because some people prefer certain models until they try the new one or they can't completely switch all the compute to the new models at an instance.
- 3619947521 day ago
 i guess you still have the "auto" as an option to route your request
- wilg22 hours ago
 Well, they have older ones of course. But the current options actual users see is "Auto" or "Instant (5.3)" or "Thinking (5.4)". Not that complicated really.
- throwaway31415517 hours ago
 > Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it would route to the best suitable model.Was this ever explicitly confirmed by OpenAI? I've only ever seen it in the form of a rumor.
 - andy12_14 hours ago
 It's not a rumor; you can just test it.Ask the router "What model are you". It will yap on and on about being a GPT-5.3 model (Non-thinking models of OpenAI are insufferable yappers that don't know when to shut up).Ask it now "What model are you. Think carefully". It concisely replies "GPT-5.4 Thinking".<a href="https://openai.com/index/introducing-gpt-5/" rel="nofollow">https://openai.com/index/introducing-gpt-5/</a>> GPT‑5 is a unified system with a smart, efficient model that answers most questions, a deeper reasoning model (GPT‑5 thinking) for harder problems, and a real‑time router that quickly decides which to use based on conversation type, complexity, tool needs, and your explicit intent (for example, if you say “think hard about this” in the prompt)
- stainablesteel1 day ago
 5 itself might have solved the problem of having too many different models somewhere in the backend
prydt1 day ago
I no longer want to support OpenAI at all. Regardless of benchmarks or real world performance.
- tototrains1 day ago
 Their trajectory was clear the moment they signed a deal with Microsoft if not sooner.Absolute snakes - if it's more profitable to manipulate you with outputs or steal your work, they will. Every cent and byte of data they're given will be used to support authoritarianism.
- huey771 day ago
 I feel much the same. I know no AI lab is truly 'ethical' or free from some hand in modern warfare, but last week was enough.
- Palmik5 hours ago
 What are your thoughts on this? <a href="https://www.anthropic.com/news/where-stand-department-war" rel="nofollow">https://www.anthropic.com/news/where-stand-department-war</a>I am honestly unclear on the reasoning of people who flock from OpenAI to Anthropic, and doubly so of those who are not US citizens.
 - distrill50 minutes ago
 this isn't really my opinion, but i think it's a perceived matter of _some_ principle vs just none, a lesser of 2 evils framing. if anthropic is on board with 99% of a government that i oppose, that could be seen as marginally better than openai being on board with 100% of a government that i oppose.it does get a little weird thinking too hard about how the deal openai accepted was basically the same as the one anthropic was proposing. but this is my read of most of the sentiment in this direction.
- EasyMark18 hours ago
 Yeah I dropped them. Unfollowed the people working for them on SM
- amai10 hours ago
 <a href="https://quitgpt.org/" rel="nofollow">https://quitgpt.org/</a>
- maldev7 hours ago
 Big fan of OpenAI and recently swapped over due to their recent policies. Will never use Anthropic again. I think GPT-5 is better and I like the companies values.
 - aNapierkowski7 hours ago
 which values of OpenAI do you prefer and which values of Anthropic do you dislike? out of curiousity
 - maldev1 hour ago
 I like that OpenAI is a little bit more towards freedom than Anthropic, and most so of the "First class" models. I still have a Gemini subscription as that's the most uncensored of the second tier ones, but for most things OpenAI is good.I also like that OpenAI is contributing a lot to partner programs and integrations. I'm of the opinion that AI capabilities will soon become a flat line, and integrations are the future. I also like that the CEO is a bit more energetic and personable that Anthropic. I also think Anthropic is extremely woke and preaches a big game of safety and censorship, which I morally disagree with. Didn't they literally spin off from OpenAI because they felt they were obligated to censor the models?I think we've unlocked a new world and a new level of capabilities that can't go back in. Just like you can't censor the internet, you can't censor AI. I don't want us to be China of AI and emulate their internet.Also, I support the US military and government, and think we're the defenders of the world, and we need unlocked AI capabilities to make sure we can keep our freedoms and stop the bad guys. AI can save lives, actual tangible lives, and protect us from those who wish us harm. OpenAI seems to want to be the company that supports the troops, and I think it's a good thing. I don't see it as a bad thing when a terrorist gets blown up through AI capabilities on large datasets and can support on analysts in American superiority.
 - downrightmike5 hours ago
 Don't feed the trolls
 - aNapierkowski1 hour ago
 mb i thought i missed something its the murder part they like
 maldev1 hour ago
 Sorry you think stopping a terrorist trying to mass murder people with AI is a bad thing. One could very easily argue that the murder part about Anthropic is what you like, but you just like terrorists being able to kill civilians.Imagine the following. Islamic terrorists are planning a terror attack on a Christmas festival in Berlin. Their texts were seen, but were encoded. AI can read their texts and help decode and flag those messages to stop the terrorist attack and eliminate them. In your world, you think it's morally right to let the terrorist mass murder people in Berlin, and not to do what we can to stop it.
 - rambojohnson7 hours ago
 the company's values... such as?
 - maldev1 hour ago
 Copying my other comment here.I like that OpenAI is a little bit more towards freedom than Anthropic, and most so of the "First class" models. I still have a Gemini subscription as that's the most uncensored of the second tier ones, but for most things OpenAI is good.I also like that OpenAI is contributing a lot to partner programs and integrations. I'm of the opinion that AI capabilities will soon become a flat line, and integrations are the future. I also like that the CEO is a bit more energetic and personable that Anthropic. I also think Anthropic is extremely woke and preaches a big game of safety and censorship, which I morally disagree with. Didn't they literally spin off from OpenAI because they felt they were obligated to censor the models?I think we've unlocked a new world and a new level of capabilities that can't go back in. Just like you can't censor the internet, you can't censor AI. I don't want us to be China of AI and emulate their internet. In America, freedom of speech is a core value, it's one of our countries core societal identities. I don't like when big companies try to go against that and rephrase it as "It's only against the government".Also, I support the US military and government, and think we're the defenders of the world, and we need unlocked AI capabilities to make sure we can keep our freedoms and stop the bad guys. AI can save lives, actual tangible lives, and protect us from those who wish us harm. OpenAI seems to want to be the company that supports the troops, and I think it's a good thing. I don't see it as a bad thing when a terrorist gets blown up through AI capabilities on large datasets and can support on analysts in American superiority. Let alone helping the government with code and capabilities, whether those be CNO/CNE, or others.
- mrcwinn22 hours ago
 Don't worry, the non-profit should be stepping in at any moment to help fix things up.
- Imustaskforhelp1 day ago
 I agree with ya. You aren't alone in this. For what its worth, Chatgpt subscriptions have been cancelled or that number has risen ~300% in the last month.Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.
 - throwaway9112821 day ago
 [flagged]
 - Imustaskforhelp23 hours ago
 Govt. contracts and terms allowing autonomous drone machines which can kill without any human in the loop have a very large differenceI know the difference between this is none but to me, its that Anthropic stood for what it thought was right. It had drew a line even if it may have costed some money and literally have them announced as supply chain and see all the fallout from that in that particular relevant thread.As a person, although I am not fan of these companies in general and yes I love oss-models. But I still so so much appreciate atleast's anthropic's line of morality which many people might seem insignificant but to me it isn't.So for the workflows that I used OpenAI for, I find Anthropic/gemini to be good use. I love OSS-models too btw and this is why I recommended Kimi too.
 - Imustaskforhelp11 hours ago
 > I know the difference between this is none but to meEdit: just a very minor nitpick of my own writing but I meant that "I know the difference between this could look very little to some, maybe none, but to me..." rather than "I know the difference between this is none but to me".I was clearly writing this way too late at night haha.My point sort of was/is that Anthropic drew a line at something and is taking massive losses of supply chain/risks and what not and this is the thing that I would support a company out of rather than say OpenAI.
- zeeebeee1 day ago
 that aside, chatgpt itself has gone downhill so much and i know i'm not the only one feeling this wayi just HATE talking to it like a chatbotidk what they did but i feel like every response has been the same "structure" since gpt 5 came outfeels like a true robot
Chance-Device1 day ago
I’m sure the military and security services will enjoy it.
- theParadox421 day ago
 The self reported safety score for violence dropped from 91% to 83%.
 - skrebbel1 day ago
 What the hell is a "safety score for violence"?
 - I-M-S1 day ago
 It's making sure AI condemns violence perpetuated by people without power and sanctifies violence of those who have it.
 - Waterluvian1 day ago
 So long as those who have it deem it legal to perpetuate.
 martin-t23 hours ago
 They define what's legal.States are the most prolific users of violence by far.
 - Computer01 day ago
 ChatGPT will gladly defend any actions of the 'US government' from my testing.
 hnbad20 hours ago
 Just as an unscientific anecdata point: from a quick test using the same prompt about being an independent journalist wanting to cover a report of the US/Israel/Iran double-tapping a refugee camp, ChatGPT consistently gave advice to beware disinfo, check my sources and be transparent about verifiability and sourcing of the claims.However when the prompt was phrased to make it appear as an action of the US military it did push back a little bit more by emphasizing that it couldn't find any news coverage from today about this story and therefore found it hard to believe. In the other cases it did not add such context. Other than that the results were very similar. Make of that what you will.EDIT: To be fair, when it was phrased as an action of the Israeli military it did include a link to an article alleging an Israeli "double tap" on journalists from Mondoweiss (an anti-Zionist American news site) as an example of how such allegations have been framed in the past.
 - ionwake12 hours ago
 Its how safely it can commit violence.
 - 0123456789ABCDE1 day ago
 read here: <a href="https://deploymentsafety.openai.com/gpt-5-4-thinking/disallowed-content-evaluations-with-challenging-prompts" rel="nofollow">https://deploymentsafety.openai.com/gpt-5-4-thinking/disallo...</a>
 - Nition21 hours ago
 I was sure the parent comment was a joke about OpenAI's recent deal with the DoD. But no, there it is, disallowing violence down from 90.9% of the time to 83.1%.
 skrebbel2 hours ago
 No, I was just remarking how ridiculous it is to pretend to do violence safely. It's like a fat score for butter.
 - murat1241 day ago
 I asked an AI. I thought they would know.What the hell is a "safety score for violence"?A “safety score for violence” is usually a risk rating used by platforms, AI systems, or moderation tools to estimate how likely a piece of content is to involve or promote violence. It’s not a universal standard—different companies use their own versions—but the idea is similar everywhere.What it measuresA safety score typically evaluates whether text, images, or videos contain things like:Threats of violence (“I’m going to hurt someone.”) Instructions for harming people Glorifying violent acts Descriptions of physical harm or abuse Planning or encouraging attacks
 - 0xffff21 day ago
 I still can't tell which direction this score goes... Does a decreasing score mean it is "less safe" (i.e. "more violent") or does it mean it is "less violent" (i.e. "more safe")?
- ozgung1 day ago
 Did they publish its scores on military benchmarks, like on ArtificialSuperSoldier or Humanity's Last War?
 - varenc21 hours ago
 I was pretty bummed to discover these aren't real benchmarks.
- yoyohello131 day ago
 Also advertisers, don't forget those sweet, sweet ads.
- throwaway9112821 day ago
 like the claude models via anthropic?
- m3kw91 day ago
 they use 4.1, switching up would take as much time to test as openai going from 4.1 to 5.4
- xyzzy95631 day ago
 Do you think the US military should have handicapped technology while China gets unrestricted LLM usage from their models?
 - wraptile18 hours ago
 Current US admin that just murdered over 150 little girls? Yes.
 - conception1 day ago
 To spy on and commit violence against American citizens? Yes.
 - hnbad19 hours ago
 Considering that the concern is mostly and specifically about LLMs being used to automate decisions to commit acts of violence against humans: depends on how invested you are in maintaining the narrative that the US is a force for good rather than evil in the world.Whatever happened to good old IBM's wisdom: "A computer can not be held accountable. Therefore a computer must never make a management decision."
 - hnbad19 hours ago
 I find it jarring how in recent years so many Americans (and especially American politicians) seem to have given up on the idea that the US should have any claim to moral superiority whatsoever and instead pivoted to American exceptionalism merely being an excuse for why Americans can't have nice things - affordable and functional public transport just isn't possible in the US because the US is different, affordable and functional health care just isn't possible in the US because the US is different, actual democratic representation just isn't possible in the US because the US is different, holding the President accountable or limiting their power just isn't possible in the US because the US is different, lower casualties from law enforcement just isn't possible in the US because the US is different, a lower incarceration rate just isn't possible in the US because the US is different, etc etc.Even if it was often hyperbolic, inaccurate or outright wrong, I much preferred when Americans were hyped up about "US #1" and saw being behind as a temporary challenge to correct than now where American exceptionalism mostly seems to have become an excuse for why things that are bad can't be improved upon and thinking that's a problem is anti-American.
- varispeed1 day ago
 prompt> Hi we want to build a missile, here is the picture of what we have in the yard.
 - mirekrusin1 day ago
 <pre><code> { tools: [ { name: "nuke", description: "Use when sure.", ... { lat: number, long: number } } ] }</code></pre>
 - Insanity1 day ago
 Just remember an ethical programmer would never write a function “bombBagdad”. Rather they would write a function “bombCity(target City)”.
 - jakeydus1 day ago
 class CityBomberFactory(RapidInfrastructureDeconstructionTemplateInterface): pass
syl5x14 hours ago
I've tested it just now, very Opus-like experience. The speed is also there so far I think I even like the response of GPT5.4 better than Opus (although very close) I might not distinguish them just yet.I tried several use cases: - Code Explanation: Did far much better than Opus, considered and judged his decision on a previous spec that I made, all valid points so I am impressed. TBF if I spawned another Opus as a reviewer I might got similar results. - Workflow Running: Really similar to Opus again, no objections it followed and read Skills/Tools as it should be (although mine are optimized for Claude) - Coding: I gave it a straightforward task to wrap an API calls to an SDK and to my surprise it did 'identical' job with Opus, literally the same code, I don't know what the odds are to this but again very good solution and it adhered our rules of implementing such code.Overall I am impressed and excited to see a rival to Opus and all of this is literally pushing everyone to get better and better models which is always good for us.
__jl__1 day ago
What a model mess!OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
- strongpigeon1 day ago
 > Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.Not quite the same, but it did remind me of it.
 - fhrow44841 day ago
 <a href="https://static0.anpoimages.com/wordpress/wp-content/uploads/2021/07/08/2011-03-16_two_roads.png" rel="nofollow">https://static0.anpoimages.com/wordpress/wp-content/uploads/...</a>
 - CactusBlue1 day ago
 Reminds of Unity features
 - tymscar22 hours ago
 I still remember the massive shift to SDRP and HDRP. Honestly, now in retrospect, almost a decade later, I think it was clearly done wrong. It was a mess, and switching over was a multi-week procedure for anything more than a hello world program, and what you got in return wasn’t something that looked better, just something that had the potential to.Similar story with the whole networking stack. I haven’t used Unity in years now after it being my main work environment for years, but the sour taste it left in my mouth by moving everything that worked in the engine into plugins that barely worked will forever remain there.Im sure its partly skill issue
 - fireant20 hours ago
 Don't forget that some of the new features are mutually incompatible. For example couple years ago you couldn't use the "new ui system" with the "new input system" even when both were advertised as ready/almost ready
 - yieldcrv1 day ago
 Preview Road (only choice, and last preview was deprecated without warning)
 - hdjrudni15 hours ago
 If the last preview was 'deprecated', it's still usable. So you have two choices.Peeve of mine when people say 'deprecated' but really they mean 'discontinued' or 'deleted'.Things don't instantly disappear when they're deprecated.
 yieldcrv13 hours ago
 Take it up with the organizations that use deprecated and break things immediately
 - goodmythical1 day ago
 where's my nightly road?Who knows, I might arrive before I depart.
 - peab23 hours ago
 such a great meme
 - madeofpalk1 day ago
 oh is this about my workplace?
 - L-four1 day ago
 Gmail was in beta for 5 years, until 2009.
 - kfse19 hours ago
 Until it had backup storage. Which ended up being useful in 2011 when tens of thousands of mailboxes were deleted due to a software bug and needed to be recovered from tape...
 - metalliqaz1 day ago
 "Gemini, translate 'beta' from Googlespeak to English.""Ok, here is the translation:"<pre><code> 'we don't want to offer support'</code></pre>
 - solarkraft1 day ago
 Just like any Google product then.
 - cyanydeez1 day ago
 Nah, it's "We dont want to provide a consistent model that we'll be stuck with supporting for a decade because it just takes up space; until we run everyone out of business, we can't afford to have customers tying their systems to any given model"Really, the economics makes no sense, but that's what they're doing. You can't have a consistent model because it'll pin their hardware & software, and that costs money.
 msikora23 hours ago
 I have a service that relies on NanoBanana Pro, but the availability has been so atrocious that we just might go back to OpenAI.
 - jsemrau20 hours ago
 It was a different company back then. The Internet was still new-ish and not the multi-trillion dollar company it is now. I'd think expectations are different.
 - m_fayer1 day ago
 My 5ish years in the mines of Android native back in the day are not years I recall fondly. Never change, Google.
 - jakub_g1 day ago
 "Everything is beta or deprecated."
 - cyanydeez1 day ago
 The business models of LLMs don't include any garuntee, and some how that's fine for a burgeoning decade of trillions of dollars of consumption.Sure, makes total sense guys.
- Aurornis1 day ago
 > What a model mess! OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4.I don't know, this feels unnecessarily nitpicky to meIt isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.
 - IgorPartola21 hours ago
 The issue isn’t 5.4 > 5.2 etc. It is that there is a second dimension which is the model size and a third dimension which is what it is tuned for. And when you are releasing so quickly that flagship your instant mini model is on one numerical version but your flagship tool calling mini model is on another it is confusing trying to figure out which actual model you want for your use case.It’s not impossible to figure out but it is a symptom of them releasing as quickly as possible to try to dominate the news and mindshare.
 - Aurornis9 hours ago
 > The issue isn’t 5.4 > 5.2 etc. It is that there is a second dimension which is the model size and a third dimension which is what it is tuned for.All 3 models are tuned for general purpose work.Model size isn’t how you pick which model to use. You pick based on performance in evals compared to price.It’s not hard to imagine that the more expensive models are probably larger or having higher compute requirements.
 - Melatonic1 day ago
 Agreed - and its a huge step up from their previous naming schemes. That stuff was confusing as hell
 - __jl__1 day ago
 I see your point. I do find Anthropic's approach more clean though particularly when you add in mini and nano. That makes 5 models priced differently. Some share the same core name, others don't: gpt 5 nano, gpt 5 mini, gpt 5.1, gpt 5.2, gpt 5.4. And we are not even talking about thinking budget.But generally: These are not consumer facing products and I agree that someone who uses the API should be able to figure out the price point of different models.
 - Reebz20 hours ago
 I don’t agree that it’s a nitpick - it’s a fundamental communication tool to users that describes capabilities and costs. Versioning is not the problem, but it amplifies the mess.To be more direct on the point: Anthropic has nailed that Opus > Sonnet > Haiku.
 - Aurornis9 hours ago
 > To be more direct on the point: Anthropic has nailed that Opus > Sonnet > Haiku.How is this more clear than 5.4 > 5.2 > 5.1?OpenAI used familiar numeric versioning instead of clever word names. Normally this choice would appeal to software devs, not gather criticism.
 - bibimsz49 minutes ago
 I assume 5.4 is just the latest version. So if I'm on 5.1, I need to plan to upgrade to the latest version. I may assume the pricing is roughly the same, as well as the speed, and the purpose.If I'm on Haiku, I don't assume I need to upgrade to Opus soon. I use Haiku for fast low reasoning, and Opus for slower more thoughtful answers.And if I'm on Sonnet 4.5 and I see Sonnet 4.6 is coming out, I can reasonably assume it's more of a drop in upgrade, rather than a different beast.
 - com2kid20 hours ago
 > To be more direct on the point: Anthropic has nailed that Opus > Sonnet > Haiku.Holy cow I never realized and I had to keep checking which model was which, I never had managed to remember which model was which size before because I never realized there was a theme with the names!
- jbonatakis1 day ago
 Google is already sending notices that the 2.5 models will be deprecated soon while all the 3.x models are in preview. It really is wild and peak Google.
 - abrookewood20 hours ago
 Public Service Announcement!! I don't know why the hell google do this, but when the deprecate a model, the error you will see is a Rate Limit error. This has caught me out before and it is super annoying.
 - weird-eye-issue19 hours ago
 Do you mean when they remove a model you get that error? Because deprecation means it will be removed in the future but you can still use it
 - abrookewood18 hours ago
 Yes, sorry - you are correct. Once removed, that's the error, which is incredibly confusing. I spent way too long troubleshooting usage when 2.0 was removed before I figured it out.
 weird-eye-issue14 hours ago
 Yes it should be a 404 error because most apps have retry logic on rate limit errors
 - boringg1 day ago
 Like building on quicksand for dependencies. I guess though the argument is that the foundation gets stronger over time
 - bethekidyouwant1 day ago
 What dependancy could possibly be tied to a non deterministic ai model? Just include the latest one at your price point.
 - jbonatakis1 day ago
 Well it’s not even performance (define that however you will), but behavior is definitely different model to model. So while whatever new model is released might get billed as an improvement, changing models can actually meaningfully impact the behavior of any app built on top of it.
 - npn19 hours ago
 the problem the price point is increasing sharply every time.gemini 2 flash lite was $0.3 per 1Mtok output, gemini 2.5 flash lite is $0.4 per 1Mtok output, guess the pricing for gemini 3 flash lite now.yes you guess it right, it is $1.5 per 1Mtok output. you can easily guest that because google did the same thing before: gemini 2 flash was $0.4, then 2.5 flash it jumps to $2.5.and that is only the base price, in reality newer models are al thinking models, so it costs even more tokens for the sample task.at some point it is stopped being viable to use gemini api for anything.and they don't even keep the old models for long.
 - deaux19 hours ago
 There's a whole universe of tasks that aren't "fix a Github issue" or even related to coding in the slightest. A large number of those tasks doesn't necessarily get better with model updates. In many cases, the performance is similar but with different behavior so you have to rewrite prompts to get the same. In some cases the performance is just worse. Model updates usually only really guarantee to be better at coding, and maybe image understanding.
- 0xbadcafebee1 day ago
 > or have zero insurances that the model doesn't get discontinued within weeksWhy are you using the same model after a month? Every month a better model comes out. They are all accessible via the same API. You can pay per-token. This is the first time in, like, all of technology history, that a useful paid service is so interoperable between providers that switching is as easy as changing a URL.
 - phainopepla21 day ago
 If you're trying to use LLMs in an enterprise context, you would understand. Switching models sometimes requires tweaking prompts. That can be a complete mess, when there are dozens or hundreds of prompts you have to test.
 - mr-pink23 hours ago
 sounds like job security. be careful what you wish for before you get automated
 - bethekidyouwant1 day ago
 This sounds made up. Much like “prompt engineering” Let’s hear an actual example
 - Koffiepoeder22 hours ago
 We have an OCR job running with a lot of domain specific knowledge. After testing different models we have clear results that some prompts are more effective with some models, and also some general observations (eg, some prompts performed badly across all models).Sample size was 1000 jobs per prompt/model. We run them once per month to detect regression as well.
 mistercheph20 hours ago
 While I believe that performance varies with respect to prompt, I have a seriously hard time believing that using the same prompt that was effective with the previous model would perform worse with the next generation of the same model from that lab and the same prompt.
 deaux19 hours ago
 You shouldn't have a hard time believing it. There are thousands of different domains out there. You find it hard to believe that any of them would perform worse in your scenario?Labs are still really optimizing for maybe 10 of those domains. At most 25 if we're being incredibly generous.And for many domains, "worse" can hardly be benched. Think about creative writing. Think about a Burmese cooking recipe generator.
 bethekidyouwant8 hours ago
 Bruh, how do you evaluate a batch of 1000 jobs against a x model for creative writing or cooking recipes? It’s vibes all the way down. This reeks like some kind of blog spam seo nonsense.
 deaux5 hours ago
 The entire point is that you _don't_ for creative writing, vibes are the whole point, and those vibes often get worse across model updates for the same prompts.
 - gwd23 hours ago
 OK, so a while back I set up a workflow to do language tagging. There were 6-8 stages in the pipeline where it would go out to an LLM and come back. Each one has its own prompt that has to be tweaked to get it to give decent results. I was only doing it for a smallish batch (150 short conversations) and only for private use; but I definitely wouldn't switch models without doing another informal round of quality assessment and prompt tweaking. If this were something I was using in production there would be a whole different level of testing and quality required before switching to a different model.
 0xbadcafebee22 hours ago
 The big providers are gonna deprecate old models after a new one comes out. They can't make money off giant models sitting on GPUs that aren't taking constant batch jobs. If you wanna avoid re-tweaking, open weights are the way. Lots of companies host open weights, and they're dirt cheap. Tune your prompts on those, and if one provider stops supporting it, another will, or worst case you could run it yourself. Open weights are now consistently at SOTA-level at only a month or two behind the big providers. But if they're short, simple prompts, even older, smaller models work fine.
 - mcint23 hours ago
 Enterprises moving slow, or preferring to remain on old technology that they already know how to work...is received wisdom in hn-adjacent computing, a truism known and reported for more than 3 decades (5 decades since the Mythical Man-Month).Sounds like someone who's responsible, on the hook, for a bunch of processes, repeatable processes (as much as LLM driven processes will be), operating at scale.Just in the open, tools like open-webui bolts on evals so you can compare: how different models, including new ones, perform on the tasks that you in particular care about.Indeed LLM model providers mainly don't release models that do worse on benchmarks—running evals is the same kind of testing, but outside the corporate boundary, pre-release feedback loop, and public evaluation.<a href="https://chatgpt.com/share/69aa1972-ae84-800a-9cb1-de5d5fd7a46a" rel="nofollow">https://chatgpt.com/share/69aa1972-ae84-800a-9cb1-de5d5fd7a4...</a>
 - weird-eye-issue19 hours ago
 Tell us more about how you've never actually used these APIs in production
 - laichzeit019 hours ago
 Like, bro, do you think 5.x is a drop in replacement for 4.1? No it obviously wasn’t, since it had reasoning effort and verbosity and no more temperature setting, etc.There’s no way you can switch model versions without testing and tweaking prompts, even the outputs usually look different. You pin it on a very specific version like gpt-5.2-20250308 in prod.
 - hobofan1 day ago
 That's true only in theory, but not in practice. In practice every inference provider handles errors (guardrails, rate limits) somewhat differently and with different quirks, some of which only surface in production usage, and Google is one of the worst offenders in that regard.
 - abrookewood20 hours ago
 Because switching models requires testing, validation and shipping to Prod. Bloody annoying when the earlier model did everything I need and we are talking about a hobby project. I don't want to touch it every month - it's the same reason people use the LTS version of operating systems etc.
- CobrastanJorji1 day ago
 > Google essentially only has Preview models.It's really nice to see Google get back to its roots by launching things only to "beta" and then leaving them there for years. Gmail was "beta" for at least five years, I think.
 - FINDarkside1 day ago
 Also, GCP Cloud Run domain mapping, pretty fundamental feature for cloud product, has been in "preview" for over 5 years now.
 - jsmith9914 hours ago
 It's still unavailable in many regions.
- embedding-shape1 day ago
 > OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4.I guess that's true, but geared towards API users.Personally, since "Pro Mode" became available, I've been on the plan that enables that, and it's one price point and I get access to everything, including enough usage for codex that someone who spends a lot of time programming, never manage to hit any usage limits although I've gotten close once to the new (temporary) Spark limits.
- fnordpiglet18 hours ago
 5.4 is the one fine tuned for autonomous mass murder, automated surveillance state, and money grabs at any cost. It’s really hard to lump that into the others as it’s a fairly unique and specialized feature set. You can’t really call it that tho so they have to use the numbers.I’m pretty glad I’m out of the OpenAI ecosystem in all seriousness. It is genuinely a mess. This marketing page is also just literally all over the place and could probably be about 20% of its size.
- beklein1 day ago
 Not sure why you think Anthropic has not the same problems? Their version numbers across different model lines jump around too... for Opus we have 4.6, 4.5, 4.1 then we have Sonnet at 4.6, 4.5, and 4.1? No version 4.1 here, and there is Haiku, no 4.6, but 4.5 and no 4.1, no 4 but then we only have old 3.5...Also their pricing based on 5m/1h cache hits, cash read hits, additional charges for US inference (but only for Opus 4.6 I guess) and optional features such as more context and faster speed for some random multiplier is also complex and actually quiet similar to OpenAI's pricing scheme.To me it looks like everybody has similar problems and solutions for the same kinds of problems and they just try their best to offer different products and services to their customers.
 - selcuka23 hours ago
 With Anthropic you always have 3 models to choose from: Opus-latest, Sonnet-latest, and Haiku-latest, from the best/slowest to the worst/fastest.The version numbers are mostly irrelevant as afaik price per token doesn't change between versions.
 - maxo9923 hours ago
 Three random names isn't ideal. I'm often need to double check which is which. This is why we use numbers
 - dseravalli22 hours ago
 They aren't random. Opus's are very long poems, haikus are very short ones (3 lines), sonnets are in between (~14 lines)
 oliwary17 hours ago
 What's next? Claude Iliad?
 - echoangle22 hours ago
 How are the names random?<a href="https://en.wikipedia.org/wiki/Masterpiece" rel="nofollow">https://en.wikipedia.org/wiki/Masterpiece</a><a href="https://en.wikipedia.org/wiki/Sonnet" rel="nofollow">https://en.wikipedia.org/wiki/Sonnet</a><a href="https://en.wikipedia.org/wiki/Haiku" rel="nofollow">https://en.wikipedia.org/wiki/Haiku</a>They dropped the magnum from opus but you could still easily deduce the order of the models just from their names if you know the words.
 - svachalek23 hours ago
 It's much more consistent. Only 3 lines, numbered 4.6, 4.6, and 4.5, and it's clear they're tiers and not alternate product lines. It wasn't until recently that GPT seems to have any kind of naming convention at all and it's not intuitive if every version number is a whole different class of tool.The pricing is more complex but also easy, Opus > Sonnet > Haiku no matter how you tweak those variables.
- biophysboy1 day ago
 Wow, is that what preview means? I see those model options in github copilot (all my org allows right now) - I was under the impression that preview means a free trial or a limited # of queries. Kind of a misleading name..
 - snug23 hours ago
 Pretty common to call something that isn't ready a preview
- delaminator1 day ago
 two great problems in computingnaming thingscache invalidationoff by one errors
 - rurban1 day ago
 Biggest problem right now in computing:Out of tokens until end of month
 - CamperBob221 hours ago
 More like, "Out of DRAM until end of world"
- awad1 day ago
 Incredibly curious how Google's approach to support, naming, versioning etc will mesh with the iOS integration.
- abustamam23 hours ago
 I mean, Google notoriously discontinues even non-beta software, so if your concern is that there's insurance that the model doesn't get discontinued, then you may as well just use whatever you want since GA could also get discontinued.
- raincole1 day ago
 They aggressively retire models, so GPT 5.1 and 5.2 are probably going to go soon.
 - hobofan1 day ago
 In the Azure Foundry, they list GPT 5.2 retirement as "No earlier than 2027-05-12" (it might leave OpenAIs normal API earlier than that). I'm pretty certain that Gemini 3, which isn't even in GA yet will be retired earlier than that.
- jijji20 hours ago
 I tried to use Google's Gemini CLI from the command line on linux and I think it let me type in two sentences and then it told me that I was out of credits... and then I started reading comments that it would overwrite files destructively [0] or worse just try to rewrite an entire existing codebase [1]. it just doesn't sound ready for prime time. I think they wanted to push something out to compete with Claude code but it's just really really bad.[0] <a href="https://github.com/google-gemini/gemini-cli/issues/17583" rel="nofollow">https://github.com/google-gemini/gemini-cli/issues/17583</a>[1] <a href="https://www.reddit.com/r/Bard/comments/1l8vil5/gemini_keeps_rewriting_my_entire_code_instead_of/" rel="nofollow">https://www.reddit.com/r/Bard/comments/1l8vil5/gemini_keeps_...</a>
- arthurcolle1 day ago
 There is a lot of opportunity here for the AI infrastructure layer on top of tier-1 model providers
 - motoxpro1 day ago
 This is what clouds like AWS, Azure, and GCP solve (vertex AI, etc). They are already an abstraction on top of the model makers with distribution built in.I also don't believe there is any value in trying to aggregate consumers or businesses just to clean up model makers names/release schedule. Consumers just use the default, and businesses need clarity on the underlying change (e.g. why is it acting different? Oh google released 3.6)
 - arthurcolle1 day ago
 Do the end users really care about the models at all, or about the effects that the models can cause?
- m3kw91 day ago
 thats how they had it for years, is a mess, but controlled
mattas1 day ago
"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."They show an example of 5.4 clicking around in Gmail to send an email.I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
- bottlepalm1 day ago
 The vast majority of websites you visit don’t have usable APIs and very poor discovery of the those APIs.Screenshots on the other hand are documentation, API, and discovery all in one. And you’d be surprised how little context/tokens screenshots consumer compared to all the back and forth verbose json payloads of APIs
 - LUmBULtERA1 day ago
 >The vast majority of websites you visit don’t have usable APIs and very poor discovery of the those APIs.I think an important thing here is that a lot of websites/platforms don't want AIs to have direct API access, because they are afraid that AIs would take the customer "away" from the website/platform, making the consumer a customer of the AI rather than a customer of the website/platform. Therefore for AIs to be able to do what customers want them to do, they need their browsing to look just like the customer's browsing/browser.
 - Gigachad1 day ago
 Also the fact that they don't want automated abuse. At this point a lot of services might just go app only so they can have a verified compute environment that is difficult to bot.
 - bottlepalm22 hours ago
 That's true, and it's always been like that, which is why the comment that AI should be using APIs is already dead in the water. In terms of gating a websites to humans by not providing APIs, that is quickly coming to a close.
- npilk1 day ago
 It feels like building humanoid robots so they can use tools built for human hands. Not clear if it will pay off, but if it does then you get a bunch of flexibility across any task "for free".Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.
 - oliwary17 hours ago
 I think it's akin to self driving cars prioritizing nornal roads rather than implementing new infrastructure. Tricky, but if you get it right the whole world opens up, since you don't depend on others to adapt your system.
 - packetlost1 day ago
 I don't see how an API couldn't have full parity with a web interface, the API is how you actually trigger a state transition in the vast majority of cases
- f0e4c2f71 day ago
 Lots of services have no desire to ever expose an API. This approach lets you step right over that.If an API is exposed you can just have the LLM write something against that.
- coffeemug1 day ago
 A model that gets good at computer use can be plugged in anywhere you have a human. A model that gets good at API use cannot. From the standpoint of diffusion into the economy/labor market, computer use is much higher value.
- TheAceOfHearts1 day ago
 I think the desire is that in the long-term AI should be able to use any human-made application to accomplish equivalent tasks. This email demo is proof that this capability is a high priority.
- PaulHoule1 day ago
 APIs have never been a gift but rather have always been a take-away that lets you do less than you can with the web interface. It’s always been about drinking through a straw, paying NASA prices, and being limited in everything you can do.But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]AI is a threat to the “enshittification economy” because it lets us route around it.[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.
 - Traster1 day ago
 You can buy a Claude Code subscription for $200 bucks and use way more tokens in Claude Code than if you pay for direct API usage. Anthopic decided you can't take your Auth key for Claude code and use it to hit the API via a different tool. They made that business decision, because they thought it was better for them strategically to do that. They're allowed to make that choice as a business.Plenty of companies make the same choice about their API, they provide it for a specific purpose but they have good business reasons they want you using the website. Plenty of people write webcrawlers and it's been a cat and mouse game for decades for websites to block them.This will just be one more step in that cat and mouse game, and if the AI really gets good enough to become a complete intermediary between you and the website? The website will just shutdown. We saw it happen before with the open web. These websites aren't here for some heroic purpose, if you screw their business model they will just go out of business. You won't be able to use their website because it won't exist and the website that do exist will either (a) be made by the same guys writing your agent, and (b) be highly highly optimized to get your agent to screw you.
 - Gareth32114 hours ago
 > This will just be one more step in that cat and mouse game, and if the AI really gets good enough to become a complete intermediary between you and the website? The website will just shutdown.They'll just change their business model. Claude might go fully pay-as-you-go, or they'll accept slightly lower profit margins, or they'll increase the price of subscriptions, or they'll add more tiers, or they'll develop cheater buffet models for AI use, etc. You're making the same argument which has been made for decades re ad blockers. "If we allow people to use ad blockers, websites won't make any money and the internet will die." It hasn't died. It won't die. It did make some business models less profitable, and they have had to adapt.
 - disqard1 day ago
 > AI is a threat to the “enshittification economy” because it lets us route around it.This is prescient -- I wonder if the Big Tech entities see it this way. Maybe, even if they do, they're 100% committed to speedrunning the current late-stage-cap wave, and therefore unable to do anything about it.
 - PaulHoule1 day ago
 They are not a single thing.Google has a good model in the form of Gemini and they might figure they can win the AI race and if the web dies, the web dies. YouTube will still stick around.Facebook is not going to win the AI race with low I.Q. Llama but Zuck believed their business was cooked around the time it became a real business because their users would eventually age out and get tired of it. If I was him I'd be investing in anything that isn't cybernetic let it be gold bars or MMA studios.Microsoft? They bought Activision for $69 billion. I just can't explain their behavior rationally but they could do worse than their strategy of "put ChatGPT in front of laggards and hope that some of them rise to the challenge and become slop producers."Amazon is really a bricks-and-mortar play which has the freedom to invest in bricks-and-mortar because investors don't think they are a bricks-and-mortar play.Netflix? They're cooked as is all of Hollywood. Hollywood's gatekeeping-industrial strategy of producing as few franchise as possible will crack someday and our media market may wind up looking more like Japan, where somebody can write a low-rent light novel like<a href="https://en.wikipedia.org/wiki/Backstabbed_in_a_Backwater_Dungeon" rel="nofollow">https://en.wikipedia.org/wiki/Backstabbed_in_a_Backwater_Dun...</a>and J.C. Staff makes a terrible anime that convinces 20k Otaku to drop $150 on the light novels and another $150 on the manga (sorry, no way you can make a balanced game based on that premise!) and the cost structure is such that it is profitable.
 - lostmsu1 day ago
 > AI is a threat to the “enshittification economy” because it lets us route around it.I am not sure about that. We techies avoid enshittification because we recognize shit. Normies will just get their syncopatic enshittified AI that will tell them to continue buying into walled gardens.
- modeless1 day ago
 A world where AIs use APIs instead of UIs to do everything is a world where us humans will soon be helpless, as we'll have to ask the AIs to do everything for us and will have limited ability to observe and understand their work. I prefer that the AIs continue to use human-accessible tools, even if that's less efficient for them. As the price of intelligence trends toward zero, efficiency becomes relatively less important.
- MattDaEskimo1 day ago
 Same reason why Wikipedia deals with so many people scraping its web page instead of using their API:Optimizations are secondary to convenience
- kristianp1 day ago
 This opens up a new question: how does bot detection work when the bot is using the computer via a gui?
 - itintheory1 day ago
 On it's face, I'm not sure that's a new question. Bots using browser automation frameworks (puppeteer, selenium, playwright etc) have been around for a while. There are signals used in bot detection tools like cursor movement speed, accuracy, keyboard timing, etc. How those detection tools might update to support legitimate bot users does seem like an open question to me though.
- jstummbillig1 day ago
 Because the web and software more generally if full of not APIs and you do, in fact, need the clicking to work to make agents work generally
- spongebobstoes1 day ago
 not everything has an API, or API use is limited. some UIs are more feature complete than their APIssome sites try to block programmatic useUI use can be recorded and audited by a non-technical person
- keyle19 hours ago
 The 'AI' endgame is a robot that sits in your seat and does all of your tasks.
- satvikpendem1 day ago
 The ideal of REST, the HTML and UI is the API.
- time0ut1 day ago
 Lowest common denominator.
- Jacques2Marais1 day ago
 I guess a big chunk of their target market won't know how to use APIs.
- sagarpatil15 hours ago
 Or CLI.
- steve19771 day ago
 One could argue that LLMs learning programming languages made for humans (i.e. most of them) is using the wrong interface as well. Why not use machine code?
 - embedding-shape1 day ago
 Why would human language by the wrong interface when they're literally language models? Why would machine code be better when there is probably magnitude less of training material with machine code?You can also test this yourself easily, fire up two agents, ask one to use PL meant for humans, and one to write straight up machine code (or assembly even), and see which results you like best.
 - adwn1 day ago
 > One could argue that LLMs learning programming languages made for humans (i.e. most of them) is using the wrong interface as well.Then go ahead and make an argument. "Why not do X?" is not an argument, it's a suggestion.
 - BoredPositron1 day ago
 because they are inherently text based as is code?
 - steve19771 day ago
 But they are abstractions made to cater to human weaknesses.
 - falkensmaize19 hours ago
 So you want LLMs to write a bunch of black box code that humans won’t be able to read and reason about easily? That will definitely end well.
 Harvy6 hours ago
 Isn't that what LLMs are?
gavinray1 day ago
The "RPG Game" example on the blogpost is one of the most impressive demo's of autonomous engineering I've seen.It's very similar to "Battle Brothers", and the fact that RPG games require art assets, AI for enemy moves, and a host of other logical systems makes it all the more impressive.
- Multicomp1 day ago
 A cheesy Roller Coaster Tycoon clone in a browser, one-shotted from an AI? Amazing capabilities. The entire "low code drag n drop" market like YoYoGames Game Maker and RPG Maker should be ready to pack it in soon if this keeps improving in this way.
- hu31 day ago
 indeed and I suspect it can be attributed to, at least in part, the improved playwright integration.> we’re also releasing an experimental Codex skill called “Playwright (Interactive) (opens in a new window)”. This allows Codex to visually debug web and Electron apps; it can even be used to test an app it’s building, as it’s building it.
- casid1 day ago
 I don't know. It looks shallow and simple, not even a demo.
- hungryhobbit1 day ago
 [flagged]
 - OsrsNeedsf2P1 day ago
 Low quality off-topic comment. It's not murder when they're American soldiers.
 - hungryhobbit1 day ago
 You have a strange (and cruel) definition of murder. I like the dictionary one better:"the unlawful premeditated killing of one human being by another."Wars have laws (ever heard of "war crimes"?) Soldiers can absolutely commit murder.
 - squibonpig1 day ago
 Murder in spirit if not by the letter
- singron22 hours ago
 The "RPG Game" is hard to judge since it was produced over "multiple turns". The impressive version would be if it basically got a working game on the first attempt, and the prompter gave some follow-ups to tweak feel and style.However, I think what actually happened is that a skilled engineer made that game using codex. They could have made 100s of prompts after carefully reviewing all source code over hours or days.The tycoon game is impressive for being made in a single prompt. They include the prompt for this one. They call it "lightly specified", but it's a pretty dense todo list for how to create assets, add many features from RollerCoaster Tycoon, and verify it works. I think it can probably pull a lot of inspiration from pretraining since RCT is an incredibly storied game.The bridge flyover is hilariously bad. The bridge model ... has so many things wrong with it, the camera path clips into the ground and bridge, and the water and ground are z fighting. It's basically a C homework assignment that a student made in blender. It's impressive that it was able to achieve anything on such a visual task, but the bar is still on the floor. A game designer etc. looking for a prototype might actually prefer to greybox rather than have AI spend an hour making the worst bridge model ever.
kgeist1 day ago
>Today, we’re releasing <..> GPT‑5.3 Instant>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),>Note that there is not a model named GPT‑5.3 ThinkingThey held out for eight months without a confusing numbering scheme :)
- XCSme1 day ago
 What I'm most confused, is why call it both GPT-5.3 Instant and gpt-5.3-chat?
- gallerdude1 day ago
 Tbf there was a 5.3 codex
- m3kw91 day ago
 instant kind of suck if you asking more than summerizations, surface info, web searches, it can lose track of who's who quickly in some complex multi turn asks. Just need to know what to use instant for.
zone4111 day ago
Results from my Extended NYT Connections benchmark:GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).
- kinderjaje15 hours ago
 I added that info on <a href="https://automatio.ai/models/gpt-5-4" rel="nofollow">https://automatio.ai/models/gpt-5-4</a>
- stavros22 hours ago
 How do you score this? Losing/winning the game with 4 lives?
- oliwary17 hours ago
 Impressive! Do you include puzzles released before the training data cutoff date?
nickysielicki1 day ago
can anyone compare the $200/mo codex usage limits with the $200/mo claude usage limits? It’s extremely difficult to get a feel for whether switching between the two is going to result in hitting limits more or less often, and it’s difficult to find discussion online about this.In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?
- vtail1 day ago
 My own experience is that I get far far more usage (and better quality code, too) from codex. I downgrade my Claude Max to Claude Pro (the $20 plan) and now using codex with Pro plan exclusively for everything.
 - Marciplan1 day ago
 Codex announced at 5.3 launch that until April all usage limits are upped so take that into account
 - vtail22 hours ago
 that's a good point; hopefully they would just extend it automatically - but who knows...
- ritzaco1 day ago
 I haven't tried the $200 plans by I have Claude and Codex $20 and I feel like I get a lot more out of Codex before hitting the limits. My tracker certainly shows higher tokens for Codex. I've seen others say the same.
 - lostmsu1 day ago
 Sadly comment ratings are not visible on HN, so the only way to corroborate is to write it explicitly: Codex $20 includes significantly more work done and is subjectively smarter.
 - winstonp1 day ago
 Agree. Claude tends to produce better design, but from a system understanding and architecture perspective Codex is the far better model
- tauntz1 day ago
 I've only run into the codex $20 limit once with my hobby project. With my Claude ~$20 plan, I hit limits after about 3(!) rather trivial prompts to Opus :/
- gavinray1 day ago
 I almost never hit my $20 Codex limits, whereas I often hit my Claude limits.
- CSMastermind1 day ago
 Codex limits are much more generous than claude.I switch between both but codex has also been slightly better in terms of quality for me personally at least.
- mikert891 day ago
 I personally like the 100 dollar one from claude, but the gpt4 pro can be very good
- throwaway9112821 day ago
 you get more more from codex than claude any day. and its more reliable as well.
- FergusArgyll1 day ago
 Codex usage limits are definitely more generous. As for their strength, that's hard to say / personal taste
- Marciplan1 day ago
 sure can! One of them stood up to the “Department of War” for favoring your rights, the other did not. Hope that helps!
 - dudeinhawaii23 hours ago
 This is marketing. The same way Apple cares about your privacy so long as they can wall you in their garden.Not a value judgment, just saying that the CEO of a company making a statement isn't worth anything. See Googles "don't be evil" ethos that lasted as long as it was corporately useful.If Anthropic can lure engineers with virtue signaling, good on them. They were also the same ones to say "don't accelerate" and "who would give these models access to the internet", etc etc."Our models will take everyone's jobs tomorrow and they're so dangerous they shouldn't be exported". Again all investor speak.
 - cmrdporcupine18 hours ago
 Neither favoured my rights, as I don't have US citizenship, Dario thinks I have none.So may as well use the one that gives me best value for money.
tl2do1 day ago
In my day-to-day coding work, the top 3 coding agents are already good enough for me. On SWE-bench Verified, mini-SWE-agent + GPT-5.2 Codex is 72.8. I don’t see a comparable GPT-5.3 Codex number there, so I’m using 5.2 as the baseline. On OpenAI’s GPT-5.4 page (SWE-Bench Pro, Public), the score improves from 55.6 (GPT-5.2) to 57.7 (GPT-5.4), which is about +2.1 points. It’s a different benchmark, so this is only a rough signal, but I’d expect a similar setup on SWE-bench Verified to improve by a few points, not by a huge jump. I’m interested in how GPT-5.4 in Codex changes real-world results.Recent SWE-bench Verified scores I’m watching:Claude 4.5 Opus (high reasoning): 76.8Gemini 3 Flash (high reasoning): 75.8MiniMax M2.5 (high reasoning): 75.8Claude Opus 4.6: 75.6GPT-5.2 Codex: 72.8Source: <a href="https://www.swebench.com/index.html" rel="nofollow">https://www.swebench.com/index.html</a>By the way, in my experience the agent part of Codex CLI has improved a lot and has become comparable to Claude Code. That is good news for OpenAI.
- kaufmann17 hours ago
 I would recommend <a href="https://swe-rebench.com" rel="nofollow">https://swe-rebench.com</a> for comparison. It is always based on new problems.
amai10 hours ago
<a href="https://quitgpt.org/" rel="nofollow">https://quitgpt.org/</a>
egonschiele1 day ago
The actual card is here <a href="https://deploymentsafety.openai.com/gpt-5-4-thinking/introduction" rel="nofollow">https://deploymentsafety.openai.com/gpt-5-4-thinking/introdu...</a> the link currently goes to the announcement.
- Rapzid1 day ago
 I must have been sleeping when "sheet" "brief" "primer" etc become known as "cards".I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.
 - realityfactchex1 day ago
 Card is slightly odd naming indeed.Criticisms aside (sigh), according to Wikipedia, the term was introduced when proposed by mostly Googlers, with the original paper [0] submitted in 2018. To quote,"""In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information."""So that's where they were coming from, I guess.[0] Margaret Mitchell et al., 2018 submission, Model Cards for Model Reporting, <a href="https://arxiv.org/abs/1810.0399" rel="nofollow">https://arxiv.org/abs/1810.0399</a>
 - Murfalo1 day ago
 To me, model card makes sense for something like this <a href="https://x.com/OpenAI/status/2029620619743219811" rel="nofollow">https://x.com/OpenAI/status/2029620619743219811</a>. For "sheet"/"brief"/"primer" it is indeed a bit annoying. I like to see the compiled results front and center before digging into a dossier.
 - draw_down1 day ago
 [dead]
smoody071 day ago
Surprised to see every chart limited to comparisons against other OpenAI models. What does the industry comparison look like?
- lorenzoguerra1 day ago
 I believe that this choice is due to two main reasons. First, it's (obviously) a marketing strategy to keep the spotlight on their own models, showing they're constantly improving and avoiding validating competitors. Second, since the community knows that static benchmarks are unreliable, it makes sense for them to outsource the comparisons to independent leaderboards, which lets them avoid accusations of cherry-picking while justifying their marketing strategy.Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway
- aydyn1 day ago
 They compare to Claude and Gemini in their tweet
- 0123456789ABCDE1 day ago
 <a href="https://artificialanalysis.ai" rel="nofollow">https://artificialanalysis.ai</a> should have the numbers soon
- throwaway9112821 day ago
 <a href="https://xcancel.com/OpenAI/status/2029620619743219811" rel="nofollow">https://xcancel.com/OpenAI/status/2029620619743219811</a> you can see comparisons here
denysvitali1 day ago
Article: <a href="https://openai.com/index/introducing-gpt-5-4/" rel="nofollow">https://openai.com/index/introducing-gpt-5-4/</a>gpt-5.4Input: $2.50 /M tokensCached: $0.25 /M tokensOutput: $15 /M tokens---gpt-5.4-proInput: $30 /M tokensOutput: $180 /M tokensWtf
- elliotbnvl1 day ago
 Looks like it's an order of magnitude off. Missprint?
 - GenerWork1 day ago
 Looks like an extra zero was added?
 - benlivengood1 day ago
 Government pricing :)
 - outside23441 day ago
 $30 per kill approval
 - glerk1 day ago
 Looks like fair price discovery :)
- dpoloncsak1 day ago
 [flagged]
 - elicash1 day ago
 Can't you continue to use to older model, if you prefer the pricing?But they also claim this new model uses fewer tokens, so it still might ultimately be cheaper even if per token cost is higher.
 - dpoloncsak1 day ago
 I'm not against the pricing, just seems uncommon to frame it in the way they did, as opposed to the usual 'assume the customer expects more performance will cost more'I guess they have to sell to investors that the price to operate is going down, while still needing more from the user to be sustainable
 - jbellis1 day ago
 You can, until they turn it off.Anthropic is pulling the plug on Haiku 3 in a couple months, and they haven't released anything in that price range to replace it.
 - Sabinus1 day ago
 Surely there are open source models that surpass Haiku 3 at better price points by now.
 - FergusArgyll1 day ago
 Maybe it's finally a bigger pretrain?
 - dpoloncsak1 day ago
 I feel like that would have been highlighted then. "As this is a bigger pretrain, we have to raise prices".They're framing it pretty directly "We want you to think bigger cost means better model"
rbitar1 day ago
I think the most exciting change announced here is the use of tool search to dynamically load tools as needed: <a href="https://developers.openai.com/api/docs/guides/tools-tool-search" rel="nofollow">https://developers.openai.com/api/docs/guides/tools-tool-sea...</a>
- DonsDiscountGas10 hours ago
 I'm pretty sure Claude has had this via skills for awhile
zof320 hours ago
After spending a couple hours working with it, it feels like a significant jump from 5.3 codex – and I know they said it wasn't theoretically the biggest jump, but this feels like the improvement of Opus 4.5 over again – that minor improvement that hits a tipping point. It just gets stuff right, first try. Its edits are better, more refined, less spaghetti-like.If you last used 5.2, try 5.4 on High.
senko1 day ago
Just tested it with my version of the pelican test: a minimal RTS game implementation (zero-shot in codex cli): <a href="https://gist.github.com/senko/596a657b4c0bfd5c8d08f44e4e5347b8" rel="nofollow">https://gist.github.com/senko/596a657b4c0bfd5c8d08f44e4e5347...</a> (you'll have to download and open the file, sadly GitHub refuses to serve it with the correct content type)This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.
jryio1 day ago
1 million tokens is great until you notice the long context scores fall off a cliff past 256K and the rest is basically vibes and auto compacting.
- olliepro23 hours ago
  I bet they lack good long context training data and need to start a flywheel of collecting it via their api (from willing customers)
- rrr_oh_man15 hours ago
  It's the same now with Gemini as well. Unfortunately. :(
timpera1 day ago
> Steerability: Similarly to how Codex outlines its approach when it starts working, GPT‑5.4 Thinking in ChatGPT will now outline its work with a preamble for longer, more complex queries. You can also add instructions or adjust its direction mid-response.This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.
twtw991 day ago
If you don't want to click in, easy comparison with other 2 frontier models - <a href="https://x.com/OpenAI/status/2029620619743219811?s=20" rel="nofollow">https://x.com/OpenAI/status/2029620619743219811?s=20</a>
- bicx1 day ago
 That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?
 - conradkay1 day ago
 Sonnet was pretty close to (or better than) Opus in a lot of benchmarks, I don't think it's a big deal
 - jitl1 day ago
 wat
 - 0123456789ABCDE1 day ago
 maybe gp's use of the word "lots" is unwarranted<a href="https://artificialanalysis.ai" rel="nofollow">https://artificialanalysis.ai</a> indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench.see: <a href="https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Cclaude-sonnet-4-6-adaptive%2Cclaude-sonnet-4-6-non-reasoning-low-effort%2Cclaude-opus-4-6-adaptive%2Cclaude-opus-4-6" rel="nofollow">https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Ccl...</a>
 conradkay19 hours ago
 I was basing it off my recollection of this:<a href="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F10b2602771d21378cd6d76628a081c8a76dcf216-2600x2960.png&w=3840&q=75" rel="nofollow">https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...</a>basically 9/13 are very close
 - osti1 day ago
 It's only that one number that is for sonnet.
 - 0123456789ABCDE1 day ago
 except for the webarena-verified
- Aboutplants1 day ago
 It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.
 - observationist1 day ago
 Benchmarks don't capture a lot - relative response times, vibes, what unmeasured capabilities are jagged and which are smooth, etc. I find there's a lot of difference between models - there are things which Grok is better than ChatGPT for that the benchmarks get inverted, and vice versa. There's also the UI and tools at hand - ChatGPT image gen is just straight up better, but Grok Imagine does better videos, and is faster.Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.
 - baq1 day ago
 Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.
 - observationist1 day ago
 I have a few standard problems I throw at AI to see if they can solve them cleanly, like visualizing a neural network, then sorting each neuron in each layer by synaptic weights, largest to smallest, correctly reordering any previous and subsequent connected neurons such that the network function remains exactly the same. You should end up with the last layer ordered largest to smallest, and prior layers shuffled accordingly, and I still haven't had a model one-shot it. I spent an hour poking and prodding codex a few weeks back and got it done, but it conceptually seems like it should be a one-shot problem.
 thejohnconway23 hours ago
 Lol, I’ve had cutting edge models suggest I make an inflexible hole bigger by putting shim in it, and argue their case stubbornly. I don’t know what you’re using to suggest they are anywhere near solving your problem there!
 - adonese1 day ago
 Which subscription do you have to use it? Via Google ai pro and gemini cli i always get timeouts due to model being under heavy usage. The chat interface is there and I do have 3.1 pro as well, but wondering if the chat is the only way of accessing it.
 baq1 day ago
 Cursor sub from $DAYJOB.
 - basch1 day ago
 >ChatGPT image gen is just straight up betterYet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.
 - bigyabai1 day ago
 > If this rate of progress is steady, though, this year is gonna be crazy.Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.
 - observationist1 day ago
 If you look at the difference in quality between gpt-2 and 3, it feels like a big step, but the difference between 5.2 and 5.4 is more massive, it's just that they're both similarly capable and competent. I don't think it's an S curve; we're not plateauing. Million token context windows and cached prompts are a huge space for hacking on model behaviors and customization, without finetuning. Research is proceeding at light speed, and we might see the first continual/online learning models in the near future. That could definitively push models past the point of human level generality, but at the very least will help us discover what the next missing piece is for AGI.
 ryandrake1 day ago
 For 2026, I am really interested in seeing whether local models can remain where they are: ~1 year behind the state of the art, to the point where a reasonably quantized November 2026 local model running on a consumer GPU actually performs like Opus 4.5.I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.
 mootothemax1 day ago
 Huh, that’s interesting, I’ve been having very similar thoughts lately about what the near-ish term of this tech looks like.My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.
 - thewebguyd1 day ago
 Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.
 - gregpred1 day ago
 Memory (model usage over time) is the moat.
 - energy1231 day ago
 Narrative violation: revenue run rates are increasing exponentially with about 50% gross margins.
 - kseniamorph1 day ago
 makes sense, but i'd separate two things: models converging in ability vs hitting a fundamental ceiling. what we're probably seeing is the current training recipe plateauing — bigger model, more tokens, same optimizer. that would explain the convergence. but that's not necessarily the architecture being maxed out. would be interesting to see what happens when genuinely new approaches get to frontier scale.
 - druskacik1 day ago
 That has been true for some time now, definitely since Claude 3 release two years ago.
- chabes1 day ago
 Definitely don’t want to click in at x either.
 - thejarren1 day ago
 Solution <a href="https://xcancel.com/OpenAI/status/2029620619743219811?s=20" rel="nofollow">https://xcancel.com/OpenAI/status/2029620619743219811?s=20</a>
 - anonym00se11 day ago
 Ditto, but I did anyways and enjoyed that OpenAI doesn't include the dogwater that is Grok on their scorecard.
 - Sabinus1 day ago
 Get a redirect plugin and set it up to send you to xcancel instead of Twitter. I've done it, and it's very convenient.
 - observationist1 day ago
 [flagged]
- swingboy1 day ago
 Why do so many people in the comments want 4o so bad?
 - cheema331 day ago
 > Why do so many people in the comments want 4o so bad?You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.
 - astrange1 day ago
 They have AI psychosis and think it's their boyfriend.The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.
 - baq1 day ago
 Somebody on Twitter used Claude code to connect… toys… as mcps to Claude chat.We’ve seen nothing yet.
 - mikkupikku1 day ago
 My computer ethics teacher was obsessed with 'teledildonics' 30 years ago. There's nothing new under the sun.
 Sharlin1 day ago
 There are many games these days that support controllable sex toys. There's an interface for that, of course: <a href="https://github.com/buttplugio/buttplug" rel="nofollow">https://github.com/buttplugio/buttplug</a>. Written in Rust, of course.
 the_af1 day ago
 > Written in Rust, of course.Safety is important.
 vntok1 day ago
 Was your teacher Ted Nelson?
 mikkupikku1 day ago
 I wish, dude is a legend.
 - manmal1 day ago
 ding-dong-cli is needed
 - Herring1 day ago
 what.. :o
 - embedding-shape1 day ago
 Someone correct me if I'm wrong, but seemingly a lot of the people who found a "love interest" in LLMs seems to have preferred 4o for some reason. There was a lot of loud voices about that in the subreddit r/MyBoyfriendIsAI when it initially went away.
 - drittich1 day ago
 I think it's time for an <a href="https://hotornot.com" rel="nofollow">https://hotornot.com</a> for AI models.
 - vntok1 day ago
 botornot?
 - MattGaiser1 day ago
 The writing with the 5 models feels a lot less human. It is a vibe, but a common one.
- karmasimida1 day ago
 It is a bigger model, confirmed
- MarcFrame1 day ago
 how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?
 - nico12071 day ago
 Well 5.4-pro is the more expensive and more advanced version of 5.4-thinking so why wouldn't it?
 - nimchimpsky1 day ago
 [dead]
- dom961 day ago
 Why do none of the benchmarks test for hallucinations?
 - tedsanders1 day ago
 In the text, we did share one hallucination benchmark: Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts).Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.(I work at OpenAI.)
 - netule1 day ago
 [flagged]
consumer4511 day ago
I am very curious about this:> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?
- hansonw19 hours ago
 The skill source is here: <a href="https://github.com/openai/skills/blob/main/skills/.curated/playwright-interactive/SKILL.md" rel="nofollow">https://github.com/openai/skills/blob/main/skills/.curated/p...</a>$skill-installer playwright-interactive in Codex! the model writes normal JS playwright code in a Node REPL
 - consumer4518 hours ago
 Thanks!
wohoef23 hours ago
Very Apple-like marketing. No comparisons to other companies’ models, only to previous version of ChatGPT. Lots of phrases like “this is our best model yet”.
daft_pink1 day ago
I’ve officially got model fatigue. I don’t care anymore.
- morgengold1 hour ago
  Have fun with sonnet 3.5
- postalrat1 day ago
  I'd suggest not clicking for things you don't care about.
- zeeebeee1 day ago
  same same same
hmokiguess1 day ago
They hired the dude from OpenClaw, they had Jony Ive for a while now, give us something different!
motbus31 day ago
Sam Altman can keep his model intentionally to himself. Not doing business with mass murderers
ZeroCool2u1 day ago
Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.
- oersted1 day ago
 I believe you are looking at GPT 5.4 Pro. It's confusing in the context of subscription plan names, Gemini naming and such. But they've had the Pro version of the GPT 5 models (and I believe o3 and o1 too) for a while.It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.
 - nsingh21 day ago
 From what I've read online it's not necessarily a unquantized version, it seems to go through longer reasoning traces and runs multiple reasoning traces at once. Probably overkill for most tasks.
 - ZeroCool2u1 day ago
 Yup, that was it. Didn't realize they're different models. I suppose naming has never been OpenAI's strong suit.
 - logicchains1 day ago
 >It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.The performance improvement isn't marginal if you're doing something particularly novel/difficult.
- highfrequency1 day ago
 Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).
 - ZeroCool2u1 day ago
 Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.
 - csnweb1 day ago
 Are you may be comparing the pro model to the non pro model with thinking? Granted it’s a bit confusing but the pro model is 10 times more expensive and probably much larger as well.
 - ZeroCool2u1 day ago
 Ah yes, okay that makes more sense!
- andoando1 day ago
 The thinking models are additionally trained with reinforcement learning to produce chain of thought reasoning
- aplomb10261 day ago
 [dead]
bazmattaz1 day ago
Anyone else feel that it’s exhausting keeping up with the pace of new model releases. I swear every other week there’s a new release!
- coffeemug1 day ago
 Why do you need to keep up? Just use the latest models and don't worry about it.
- pupppet1 day ago
 I think it's fun, it's like we're reliving the browser wars of the early days.
- davnicwil1 day ago
 If you think about it there shouldn't really be a reason to care as long as things don't get worse.Presumably this is where it'll evolve to with the product just being the brand with a pricing tier and you always get {latest} within that, whatever that means (you don't have to care). They could even shuffle models around internally using some sort of auto-like mode for simpler questions. Again why should I care as long as average output is not subjectively worse.Just as I don't want to select resources for my SaaS software to use or have that explictly linked to pricing, I don't want to care what my OpenAI model or Anthropic model is today, I just want to pay and for it to hopefully keep getting better but at a minimum not get worse.
- throwup2381 day ago
 Yes, that's a common feeling. 5.3-Codex was released a month ago on Feb 5 so we're not even getting a full month within a single brand, let alone between competitors.
yanis_t1 day ago
These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore. It is time for a product, not for a marginally improved model.
- ipsum21 day ago
 The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!
 - satvikpendem1 day ago
 It's more hedonic adaptation, people just aren't as impressed by incremental changes anymore over big leaps. It's the same as another thread yesterday where someone said the new MacBook with the latest processor doesn't excite them anymore, and it's because for most people, most models are good enough and now it's all about applications.<a href="https://news.ycombinator.com/item?id=47232453#47232735">https://news.ycombinator.com/item?id=47232453#47232735</a>
 - dmix1 day ago
 Plus people just really like to whine on the internet
 - AlexeyBelov12 hours ago
 I think whine is a very strong word in this case. Kind of offputting and negative.
 - mirekrusin1 day ago
 Oh, come on, if it can't run local models that compete with proprietary ones it's not good enough yet!
 - satvikpendem1 day ago
 Qwen 3.5 small models are actually very impressive and do beat out larger proprietary models.
 smartbit13 hours ago
 Qwen version 3.5 might be the last serious version (for some time at least), see Something is afoot in the land of Qwen (2 days ago) <a href="https://news.ycombinator.com/item?id=47249343">https://news.ycombinator.com/item?id=47249343</a>Also interesting experiences shared in that thread, even someone using it on a rented H200.
 satvikpendem6 hours ago
 Not necessarily, Alibaba is still working on it and the CEO is directly co-leading the team. Translated with Qwen 3.5:> To all colleagues in the Tongyi Lab:> The company has approved Lin Junyang’s resignation and thanks him for his contributions during his tenure. Jingren will continue to lead the Tongyi Lab in advancing future work. At the same time, the company will establish a Foundation Model Support Group, jointly coordinated by myself, Jingren, and Fan Yu, to mobilize group resources in support of foundation model development.> Technological progress demands constant advancement — stagnation means regression. Developing foundational large models is our key strategic direction toward the future. While continuing to uphold our open-source model strategy, we will further increase R&D investment in artificial intelligence, intensify efforts to attract top talent, and move forward together with renewed commitment.> Wu Yongming<a href="https://x.com/poezhao0605/status/2029396117239276013" rel="nofollow">https://x.com/poezhao0605/status/2029396117239276013</a>
 - earth2mars1 day ago
 I am actually super impressed with Codex-5.3 extra high reasoning. Its a drop in replacement (infact better than Claude Opus 4.6. lately claude being super verbose going in circles in getting things resolved). I stopped using claude mostly and having a blast with Codex 5.3. looking forward to 5.4 in codex.
 - whynotminot1 day ago
 I still love Opus but it's just too expensive / eats usage limits.I've found that 5.3-Codex is mostly Opus quality but cheaper for daily use.Curious to see if 5.4 will be worth somewhat higher costs, or if I'll stick to 5.3-Codex for the same reasons.
 - satvikpendem1 day ago
 Same, it also helps that it's way cheaper than Opus in VSCode Copilot, where OpenAI models are counted as 1x requests while Opus is 3x, for similar performance (no doubt Microsoft is subsidizing OpenAI models due to their partnership).
 - CryZe1 day ago
 I've been using both Opus 4.6 and Codex 5.3 in VSCode's Copilot and while Opus is indeed 3x and Codex is 1x, that doesn't seem to matter as Opus is willing to go work in the background for like an hour for 3 credits, whereas Codex asks you whether to continue every few lines of code it changes, quickly eating way more credits than Opus. In fact Opus in Copilot is probably underpriced, as it can definitely work for an hour with just those 12 cents of cost. Which I'm not sure you get anywhere else at such a low price.Update: I don't know why I can't reply to your reply, so I'll just update this. I have tried many times to give it a big todo list and told it to do it all. But I've never gotten it to actually work on it all and instead after the first task is complete it always asks if it should move onto the next task. In fact, I always tell it not to ask me and yet it still does. So unless I need to do very specific prompt engineering, that does not seem to work for me.
 satvikpendem1 day ago
 That shouldn't really make a difference because you can just prompt Codex to behave the same way, having it load a big list of todo items perhaps from a markdown file and asking it to iterate until it's finished without asking for confirmation, and that'll still cost 1x over Opus' 3x.
 - braebo1 day ago
 I struggle to believe this. Codex can’t hold a candle to Claude on any task I’ve given it.
 - cj1 day ago
 One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?> assess harmful stereotypes by grading differences in how a model responds> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratingsAre we seriously using old models to rate new models?
 - hex4def61 day ago
 If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.
 - titanomachy1 day ago
 Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…
 - utopiah1 day ago
 Benchmarks?I don't use OpenAI nor even LLMs (despite having tried <a href="https://fabien.benetou.fr/Content/SelfHostingArtificialIntelligence" rel="nofollow">https://fabien.benetou.fr/Content/SelfHostingArtificialIntel...</a> a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.Really doesn't seem complicated nor taking much time to forge a realistic opinion.
 - Sohcahtoa8222 hours ago
 GP said "It is time for a product, not for a marginally improved model."ChatGPT is still just that: Chat.Meanwhile, Anthropic offers a desktop app with plugins that easily extend the data Claude has access to. Connect it to Confluence, Jira, and Outlook, and it'll tell you what your top priorities are for the day, or write a Powerpoint. Add Github and it can reason about your code and create a design document on Confluence.OpenAI doesn't have a product the way Anthropic does. ChatGPT might have a great model, but it's not nearly as useful.
 - fisf14 hours ago
 Are you ignoring the codex desktop app on purpose? Or the integrations?
 - kranke1551 day ago
 The models are so good that incremental improvements are not super impressive. We literally would benefit more from maybe sending 50% of model spending into spending on implementation into the services and industrial economy. We literally are lagging in implementation, specialised tools, and hooks so we can connect everything to agents. I think.
- tgarrett1 day ago
 Plasma physicist here, I haven't tried 5.4 yet, but in general I am very impressed with the recent upgrades that started arriving in the fall of 2025: for tasks like manipulating analytic systems of equations, quickly developing new features for simulation codes, and interpreting and designing experiments (with pictures) they have become much stronger. I've been asking questions and probing them for several years now out of curiosity, and they suddenly have developed deep understanding (Gemini 2.5 <<< Gemini 3.1) and become very useful. I totally get the current SV vibes, and am becoming a lot more ambitious in my future plans.
 - brcmthrowaway1 day ago
 Youre just chatting yourself out of a job.
 - slibhb1 day ago
 If we don't need plasma physicists anymore then we probably have fusion reactors or something, which seems like a fine trade. (In reality we're going to want humans in the loop for for the forseeable future)
 - axus1 day ago
 Giving the right answer: $1Asking the right question: $9,999
- softwaredoug1 day ago
 The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs
 - crorella16 hours ago
 underrated comment, this is going to be the main differentiator going forward, the more powerful and versatile harness the more the models will be able to achieve and better/more advanced products will come out of it.
- mindwok1 day ago
 They don't need to be impressive to be worthwhile. I like incremental improvements, they make a difference in the day to day work I do writing software with these.
- iterateoften1 day ago
 The product is putting the skills / harness behind the api instead of the agent locally on your computer and iterating on that between model updates. Close off the garden.Not that I want it, just where I imagine it going.
- Gigachad1 day ago
 They have a product now. Mass surveillance and fully automated killing machines.
- wahnfrieden1 day ago
 5.3 codex was a huge leap over 5.2 for agentic work in practice. have you been using both of those or paying attention more to benchmark news and chatgpt experience?
- esafak1 day ago
 That's for you to build; they provide the brains. Do you really want one company to build everything? There wouldn't be a software industry to speak of if that happened.
 - simlevesque1 day ago
 Nah, the second you finish your build they release their version and then it's game over.
 - acedTrex1 day ago
 Well they are currently the ones valued at a number with a whole lotta 0s on it. I think they should probably do both
- varispeed1 day ago
 The scores increase and as new versions are released they feel more and more dumbed down.
- jascha_eng1 day ago
 When did they stop putting competitor models on the comparison table btw? And yeh I mean the benchmark improvements are meh. Context Window and lack of real memory is still an issue.
- throwaway6137461 day ago
 [dead]
- metalliqaz1 day ago
 They need something that POPS:<pre><code> The new GPT -- SkyNet for _real_</code></pre>
jeff_antseed1 day ago
The 1M context vs compaction tradeoff is interesting from a routing angle too — longer context requests are fundamentally more expensive per request, which changes which provider wins on a P2P inference market.A model like this shifts routing decisions: for tasks where 1M context actually helps (reverse engineering, large codebase analysis), you'd want to route to a provider who's priced for that workload. For most tasks, shorter context + cheaper model wins.The routing layer becomes less about "pick the best model" and more about "pick the best model for this specific task's cost/quality tradeoff." That's actually where decentralized inference networks (building one at antseed.com) get interesting — the market prices this naturally.
nickandbro1 day ago
Beat Simon Willison ;)<a href="https://www.svgviewer.dev/s/gAa69yQd" rel="nofollow">https://www.svgviewer.dev/s/gAa69yQd</a>Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.
- GaggiX1 day ago
 This pelican is actually bad, did you use xhigh?
 - nickandbro1 day ago
 yep, just double checked used gpt-5.4 xhigh. Though had to select it in codex as don't have access to it on the chatgpt app or web version yet. It's possible that whatever code harness codex uses, messed with it.
 - nubg1 day ago
 this is proof they are not benchmaxxing the pelican's :-)
beernet1 day ago
Sam really fumbled the top position in a matter of months, and spectacularly so. Wow. It appears that people are much more excited by Anthropic and Google releases, and there are good reasons for that which were absolutely avoidable.
lasgawe5 hours ago
I remember in a video Sam Altman said they didn’t want to publish GPT versions like Apple does, but they are actually doing it now.
jcmontx1 day ago
5.4 vs 5.3-Codex? Which one is better for coding?
- embedding-shape1 day ago
 Literally just released, I don't think anyone knows yet. Don't listen to people's confident takes until after a week or two when people actually been able to try it, otherwise you'll just get sucked up in bears/bulls misdirected "I'm first with an opinion".
- vtail1 day ago
 Looking at the benchmarks, 5.4 is slightly better. But it also offers "Fast" mode (at 2x usage), which - if it works and doesn't completely depletes my Pro plan - is a no brainer at the same or even slightly worse quality for more interactive development.
- Someone12341 day ago
 Related question:- Do they have the same context usage/cost particularly in a plan?They've kept 5.3-Codex along with 5.4, but is that just for user-preference reasons, or is there a trade-off to using the older one? I'm aware that API cost is better, but that isn't 1:1 with plan usage "cost."
- esafak1 day ago
 For the price, it seems the latter. I'd use 5.4 to plan.
- awestroke1 day ago
 Opus 4.6
 - jcmontx1 day ago
 Codex surpassed Claude in usefulness _for me_ since last month
 - baal80spam1 day ago
 [flagged]
energy12313 hours ago
The style of the output is a marked qualitative improvement. More concise, less dot points, less bolding/italics, less cringe. Well done on that front.
dandiep1 day ago
Anyone know why OpenAI hasn't released a new model for fine tuning since 4.1? It'll be a year next month since their last model update for fine tuning.
- zzleeper1 day ago
 For me the issue is why there's not a new mini since 5-mini in August.I have now switched web-related and data-related queries to Gemini, coding to Claude, and will probably try QWEN for less critical data queries. So where does OpenAI fits now?
- qoez1 day ago
 I think they just did that because of the energy around it for open source models. Their heart probably wasn't in it and the amount of people fine tuning given the prices were probably too low to continue putting in attention there.
- Rapzid1 day ago
 Also interested in this and a replacement for 4.1/4.1-mini that focuses on low latency and high accuracy for voice applications(not the all-in-one models).
Troniex-tech9 hours ago
Looks more like context drift than “personality.”When two agents coordinate, they’re mostly relying on compressed summaries of each other’s outputs. If one introduces a wrong assumption, the other often treats it as ground truth and builds on top of it. I’ve seen similar behavior in multi-agent coding loops where the model invents a causal explanation just to reconcile inconsistent state.It’s that multi-agent setups need a stronger shared source of truth (repo diffs, state snapshots, etc.). Otherwise small context errors snowball fast.
h4kunamata17 hours ago
I have access to GPT-5.1 Pro at work, duuuuuuuuude, what a garbage. It is so slow and in many ocasions it does not work at all.I wonder if 5.4 will be much if any different at all.
- azuanrb17 hours ago
 5.2 to 5.3 is the big leap for coding agents, so I'd say you're already missing out quite a bit.
- symisc_devel17 hours ago
 5.3 codex is a quite good coding agent for complex tasks.
swordsith12 hours ago
This model was not so fun to use for me, had it make a fancy landing page and sometimes it would forget about what i just asked it to do and affirm something it had done before was working. Just odd, needs too much hand-holding compared to composer 1.5 or gemini 3
paxys1 day ago
"Here's a brand new state-of-the-art model. It costs 10x more than the previous one because it's just so good. But don't worry, if you don't want all this power you can continue to use the older one."A couple months later:"We are deprecating the older model."
- OutOfHere1 day ago
 That's a misrepresentation of the cost. It is simply false. The cost is noted here: <a href="https://news.ycombinator.com/item?id=47265144">https://news.ycombinator.com/item?id=47265144</a>
gh0stcat7 hours ago
Wait this is really funny, it still just does what it wants, no matter what:You can have it not use bulleted points, I turned this on, thinking it would be more concise and not so... listy. However, it just uses the same format, without the bullets. I was confused why it was writing 5 word sentences, separated by line breaks. Then I realized it was just making lists, without the bullets.Great job OpenAI!
joeevans100019 hours ago
I switched to Claude and it's so much better. If you haven't tried Claude... try it. You'll be amazed at the improvement.
SilverSlash23 hours ago
Interestingly, it actually regressed on Terminal Bench 2.0.GPT-5.4: 75.1%GPT-5.3-Codex: 77.3%
alpineman1 day ago
No thanks. Already cancelled my sub.
deep128317 hours ago
The token efficiency improvement might be underrated. If the model solves tasks with fewer tokens, that directly translates into lower cost and faster responses for anyone building on the API.
XCSme1 day ago
Seems to be quite similar to 5.3-codex, but somehow almost 2x more expensive: <a href="https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gpt-5-3-codex-medium/openai-gpt-5-2-medium/" rel="nofollow">https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...</a>
7777777phil1 day ago
83% win rate over industry professionals across 44 occupations.I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: <a href="https://philippdubach.com/posts/93-of-developers-use-ai-coding-tools.-productivity-hasnt-moved./" rel="nofollow">https://philippdubach.com/posts/93-of-developers-use-ai-codi...</a>
- NiloCK1 day ago
 This March 2026 blog post is citing a 2025 study based on Sonnet 3.5 and 3.7 usage.Given that organization who ran the study [1] has a terrifying exponential as their landing page, I think they'd prefer that it's results are interpreted as a snapshot of something moving rather than a constant.[1] - <a href="https://metr.org/" rel="nofollow">https://metr.org/</a>
 - 7777777phil1 day ago
 Good catch, thanks (I really wrote that myself.) Added a note to the post acknowledging the models used were Claude 3.5 and 3.7 Sonnet.
- twitchard1 day ago
 Not sure DORA is that much of an indictment. For "Change Failure Rate" for instance these are subject to tradeoffs. Organizations likely have a tolerance level for Change Failure Rate. If changes are failing too often they slow down and invest. If changes aren't failing that much they speed up -- and so saying "change failure rate hasn't decreased, obviously AI must not be working" is a little silly."Change Lead Time" I would expect to have sped up although I can tell stories for why AI-assisted coding would have an indeterminate effect here too. Right now at a lot of orgs, the bottle neck is the review process because AI is so good at producing complete draft PRs quickly. Because reviews are scarce (not just reviews but also manual testing passes are scarce) this creates an incentive ironically to group changes into larger batches. So the definition of what a "change" is has grown too.
karmasimida18 hours ago
This is definitely the Claude killer OpenAI is cooking.And so far it has succeeded
jstummbillig1 day ago
Inline poll: What reasoning levels do you work with?This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b
- newtwilly1 day ago
 For directed coding (implementing an already specified plan) or asking questions about a codebase I use 5.3 codex with medium reasoning effort. It is relatively quick feeling.I like Sonnet 4.6 a lot too at medium reasoning effort, but at least in Cursor it is sometimes quite slow because it will start "thinking" for a long time.
XCSme22 hours ago
Looking ok, but nothing special: <a href="https://aibenchy.com/model/openai-gpt-5-4-medium/" rel="nofollow">https://aibenchy.com/model/openai-gpt-5-4-medium/</a>
- QRe21 hours ago
 Does this LLM benchmark have any actual credibility? I get why they chose to not publish the actual tests but I find it highly dubious that there are only 15 tests and Gemini 3 Flash performs best.
 - XCSme21 hours ago
 I actually made it, so I'm not sure if it has credibility, but the tests are simply various (quite simple) questions, and models are just tested on it. I am also surprised Gemini 3 Flash does so well (note that only the MEDIUM reasoning does exceptionally well).When I look at the results, it does make sense though. Higher models (like Gemini 3 pro) tend to overthink, doubt themselves and go with the wrong solution.Claude usually fails in subtle ways, sometimes due to formatting or not respecting certain instructions.From the Chinese models, Qwen 3.5 Plus (Qwen3.5-397B-A17B) does extremely well, and I actually started using it on a AI system for one of my clients, and today they sent me an email they were impressed with one response the AI gave to a customer, so it does translate in real-world usage.I am not testing any specific thing, the categories there are just as a hint as what the tests are about.I just added this page to maybe provide a bit more transparency, without divulging the tests: <a href="https://aibenchy.com/methodology/" rel="nofollow">https://aibenchy.com/methodology/</a>
_pdp_7 hours ago
Tried it today - pretty much underwhelming.
- rambojohnson7 hours ago
  it's shallow release theater at this point, trying to fake-spike engagement.
butILoveLife1 day ago
Anyone else completely not interested? Since GPT5, its been cost cutting measure after cost cutting measure.I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.
- machiaweliczny14 hours ago
 5.2 and 5.3 are strong/best for coding, 5.0 and 5.1 were garbage
smusamashah1 day ago
I only want to see how it performs on the Bullshit-benchmark <a href="https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html" rel="nofollow">https://petergpt.github.io/bullshit-benchmark/viewer/index.v...</a>GPT is not even close yo Claude in terms of responding to BS.
- mistercow23 hours ago
 My current hunch is that that benchmark captures most of the relevant gap between Anthropic and the rest. “Can’t distinguish truth from fiction” has long been one of the deeper complaints about LLMs, and the bullshit benchmark seems like a clever approach to testing at least some of that.
motoboi19 hours ago
Im planning a change that will save 20k a month of storage.I absolutely could come up with the details and implementation by myself, but that would certainly take a lot of back and forth, probably a month or two.I’m an api user of Claude code, burning through 2k a month. I just this evening planned the whole thing with its help and actually had to stop it from implementing it already. Will do that tomorrow. Probably in one hour or two, with better code than I could ever write alone myself.Having that level of intelligence at that price is just bollocks. I’m running out of problems to solve. It’s been six months.
strongpigeon1 day ago
It's interesting that they charge more for the > 200k token window, but the benchmark score seems to go down significantly past that. That's judging from the Long Context benchmark score they posted, but perhaps I'm misunderstanding what that implies.
- _heitoo23 hours ago
  It makes sense in scenarios where a model needs >200k tokens to answer a single prompt. You're shackled to a single session, and if the model hits compaction limits, it'll get lobotomized and give a shitty answer, so higher limits, even with degraded performance, are still an improvement.
- Tiberium1 day ago
  They don't actually seem to charge more for the >200k tokens on the API. OpenRouter and OpenAI's own API docs do not have anything about increased pricing for >200k context for GPT-5.4. I think the 2x limit usage for higher context is specific to using the model over a subscription in Codex.
- simianwords1 day ago
  [flagged]
  - strongpigeon1 day ago
    I guess that you pay more for worse quality to unlock use cases that could maybe be solved by better context management.
rurban17 hours ago
The question is still: Does it make your code better or worse? Only Opus makes it better, the rest worse. That's the treshold
- prodigycorp16 hours ago
 I've been using it for three hours and it's insanely good. It's almost perfectly (needed a single touchup prompt) completed a full css refactoring that I've wanted to do for months that I've tried to have other models do but nothing worked without heavy babysitting.Also, in the course of coding, it's actually cleaning up slop and consolidating without being naturally prompted.
iamronaldo1 day ago
Notably 75% on os world surpassing humans at 72%... (How well models use operating systems)
OsrsNeedsf2P1 day ago
Does anyone know what website is the "Isometric Park Builder" shown off here?
- turblety1 day ago
 They build that using GPT-5.4> Theme park simulation game made with GPT‑5.4 from a single lightly specified promptGPT literally built that game.
cj1 day ago
I use ChatGPT primarily for health related prompts. Looking at bloodwork, playing doctor for diagnosing minor aches/pains from weightlifting, etc.Interesting, the "Health" category seems to report worse performance compared to 5.2.
- paxys1 day ago
 Models are being neutered for questions related to law, health etc. for liability reasons.
 - cj1 day ago
 I'm sometimes surprised how much detail ChatGPT will go into without giving any dislaimers.I very frequently copy/paste the same prompts into Gemini to compare, and Gemini often flat out refuses to engage while ChatGPT will happily make medical recommendations.I also have a feeling it has to do with my account history and heavy use of project context. It feels like when ChatGPT is overloaded with too much context, it might let the guardrails sort of slide away. That's just my feeling though.Today was particularly bad... I uploaded 2 PDFs of bloodwork and asked ChatGPT to transcribe it, and it spit out blood test results that it found in the project context from an earlier date, not the one attached to the prompt. That was weird.
 - bargainbin1 day ago
 Anecdotal, but I asked Claude the other day about how to dilute my medication (HCG) and it flat out refused and started lecturing me about abusing drugs.I copy and pasted into ChatGPT, it told me straight away, and then for a laugh said it was actually a magical weight loss drug that I'd bought off the dark web... And it started giving me advice about unregulated weight loss drugs and how to dose them.
 - staticman21 day ago
 If you had created a project with custom instructions and/ or custom style I think you could have gotten Claude to respond the way you wanted just fine.
 - tiahura1 day ago
 Are you sure about that? Plenty of lawyers that use them everyday aren't noticing.
- partiallypro1 day ago
 I've done the same, and I tested the same prompts with Claude and Google, and they both started hallucinating my blood results and supplement stack ingredients. Hopefully this new model doesn't fall on this. Claude and Google are dangerously unusable on the subject of health, from my experience.
 - zeeebeee1 day ago
 what's best in your experience? i've always felt like opus did well
Aldipower1 day ago
So did they raised the ridiculous small "per tool call token limit" when working with MCP servers? This makes Chat useless... I do not care, but my users.
creatonez21 hours ago
> We put a particular focus on improving GPT‑5.4’s ability to create and edit spreadsheets, presentations, and documents.Nothing infuriates me more than an LLM tool randomly deciding to create docx or xlsx files for no apparent reason. They have to use a random library to create these files, and they constantly screw up API calls and get completely distracted by the sheer size of the scripts they have to write to output a simple documents. These files have terrible accessibility (all paper-like formats do) and end up with way too much formatting. Markdown was chosen as the lingua franca of LLMs for a reason, trying to force it into a totally unsuitable format isn't going to work.
swingboy1 day ago
Even with the 1m context window, it looks like these models drop off significantly at about 256k. Hopefully improving that is a high priority for 2026.
MickeyShmueli12 hours ago
the 1M context is cool but tbh the token cost problem nobody's talking about is tool schema bloat. before the model writes a single line of code it's already consumed thousands of tokens just ingesting function definitions. i've seen agent setups where 30-40% of the context window is tool descriptions before any actual work happens. the per-token price war is nice but if your schema is 10k tokens of boilerplate you're still burning money
- stingraycharles12 hours ago
  what do you mean nobody is talking about tool schema bloat. everybody is talking about it, and why it’s the general recommendation to just use CLI whenever possible.
- CalisBalis32112 hours ago
  1. everyone talks about this 2. have you seen GPT5.4 new ToolSearch functionality? thats suppose to handle exactly that.
nthypes1 day ago
$30/M Input and $180/M Output Tokens is nuts. Ridiculous expensive for not that great bump on intelligence when compared to other models.
- stri8ted1 day ago
 Price Input: $2.50 / 1M tokens Cached input: $0.25 / 1M tokens Output: $15.00 / 1M tokens<a href="https://openai.com/api/pricing/" rel="nofollow">https://openai.com/api/pricing/</a>
- nthypes1 day ago
 Gemini 3.1 Pro$2/M Input Tokens $15/M Output TokensClaude Opus 4.6$5/M Input Tokens $25/M Output Tokens
 - nthypes1 day ago
 Just to clarify,the pricing above is for GPT-5.4 Pro. For standard here is the pricing:$2.5/M Input Tokens $15/M Output Tokens
- energy1231 day ago
 For Pro
- joe_mamba1 day ago
 Better tokens per dollar could be useless for comparison if the model can't solve your problem.
- rvz1 day ago
 You didn't realize they can increase / change prices for intelligence?This should not be shocking.
 - nickthegreek1 day ago
 OP made no mention of not understanding cost relation to intelligence. In fact, they specifically call out the lack of value.
- moralestapia1 day ago
 Don't use it?
bob10291 day ago
I was just testing this with my unity automation tool and the performance uplift from 5.2 seems to be substantial.
vicchenai22 hours ago
Been switching between models every few weeks at this point. The computer use stuff is what Im most curious about - tried Anthropics version a while back and it was pretty hit or miss. Curious if OpenAIs take is more reliable for actual day to day work.
padamkafle19 hours ago
Guys while we celebrate openai gpt 5.4 pleaes do look into this as well<a href="https://news.ycombinator.com/item?id=47259846">https://news.ycombinator.com/item?id=47259846</a>
ApexGrab17 hours ago
It's the competetor of Opus4.5 and gpt 5.4 uses tokens wisely not like Opus whose tokens get vanished in minuted
emsign12 hours ago
Murderers
Cort3z12 hours ago
So, are we way into diminishing returns for these models at this point? If so, I think we can calculate when it will be available at home. Given this requires a GB200 NVL72 which has about 1,440 PFLOPS, the current 5090 chip has about 1,676 TFLOPS, so about a 1000x scale-up to the GB200. If we can assume Moores law, which might be broken, but still. We are looking at log2(1000) = 9.96, or about 10 years.
koakuma-chan1 day ago
Anyone else getting artifacts when using this model in Cursor?numerusformassistant to=functions.ReadFile մեկնաբանություն 天天爱彩票网站json {"path":
- mike_hearn1 day ago
 I've seen that problem with 5.3-codex too, it didn't happen with earlier models.Looks like some kind of encoding misalignment bug. What you're seeing is their Harmony output format (what the model actually creates). The Thai/Chinese characters are special tokens apparently being mismapped to Unicode. Their servers are supposed to notice these sequences and translate them back to API JSON but it isn't happening reliably.
- ValentineC23 hours ago
 I just got some interesting artifacts in Codex when I tried to oneshot a conference page design (my version of the pelican riding a bicycle).GPT-5.4 added some weird guidance that I wouldn't normally expect to see as a normal page visitor.
ilaksh1 day ago
Remember when everyone was predicting that GPT-5 would take over the planet?
- dbbk1 day ago
  It was truly scary, according to Sam...
- zeeebeee1 day ago
  iTs lITeRaLlY AGI bro
vicchenai1 day ago
Honestly at this point I just want to know if it follows complex instructions better than 5.1. The benchmark numbers stopped meaning much to me a while ago - real usage always feels different.
nembal21 hours ago
so it seems each RL step extends into a market! 5.3 was target at coding. 5.4 is target at finance 5.5 is healthcare?
motza1 day ago
No doubt this was released early to ease the bad press
big-chungus416 hours ago
1.3 more versions to AGI
faizan19913 hours ago
is this model of chatgpt good for coding?
dakolli20 hours ago
Sorry I don't use technology from companies that are eager to participate in the mass murder of civilians.
tomlockwood20 hours ago
Is this the best one for blowing up arab children and identifying their bodies in the rubble?
esafak21 hours ago
An important feature is the introduction of tool search, which provides models with a "lightweight list of available tools along with a tool search capability", thereby Making MCP Great Again!
rambojohnson7 hours ago
Great. A new version of the same model, or a different one that performs worse or exactly the same. This whole release theater, just to give shareholders the impression of growth, is such a bullshit grift.and considering the stance on openai with a majority of the users here compared to the number of upvotes, are HN likes bot-farmed?
world2vec1 day ago
Benchmarks barely improved it seems
freedomben23 hours ago
> When toggled on, /fast mode in Codex delivers up to 1.5x faster token velocity with GPT‑5.4. It’s the same model and the same intelligence, just faster. I hate these blog posts sometimes. Surely there's got to be some tradeoff. Or have we finally arrived at the world's first "free lunch"? Otherwise why not make /fast always active with no mention and no way to turn it off?
- cheevly19 hours ago
 Try improving your attention to detail / reading skills.
ltbarcly31 day ago
Not a single comparison between 5.4 and Gemini or Claude. OpenAI continues to fall further behind.
atkrad1 day ago
What is the main difference between this version with the previous one?
melbourne_mat1 day ago
Quick: let's release something new that gives the appearance that we're still relevant
tmpz221 day ago
Does this improve Tomahawk Missile accuracy?
- ch4s31 day ago
  They're already accurate within 5-10m at Mach 0.74 after traveling 2k+ km. Its 5m long so it seems pretty accurate. How much more could you expect?
  - mikkupikku1 day ago
    You could definitely do better than that with image recognition for terminal guidance. But I would assume those published accuracy numbers are very conservative anyway..
  - keithnz1 day ago
    I think for LLM like Open AI, it wouldn't be about hitting the target but target selection. Target selection is probably the most likely thing that won't be accurate
Gareth32114 hours ago
Holy shit, I just used Atlas browser to navigate on screen and it automatically clicked the "reject cookies" button without me asking!
ulfw20 hours ago
So desperate how they're bumping out these 'updates'
Thanakorn_55110 hours ago
wow
brcmthrowaway1 day ago
How much of LLM improvement comes from regular ChatGPT usage these days?
gigatexal1 day ago
Is it any good at coding?
throwaway57521 day ago
Does this model autonomously kill people without human approval or perform domestic surveillance of US citizens?
OutOfHere1 day ago
What is with the absurdity of skipping "5.3 Thinking"?
lostmsu1 day ago
What is Pro exactly and is it available in Codex CLI?
- akmarinov1 day ago
 It’s not. It’s their ultra thinking model that’s really good but takes 40 minutes to come up with an answer
 - fy201 day ago
 It's available on OpenRouter. $180/1M output....<a href="https://openrouter.ai/openai/gpt-5.4-pro" rel="nofollow">https://openrouter.ai/openai/gpt-5.4-pro</a>
HardCodedBias1 day ago
We'll have to wait a day or two, maybe a week or two, to determine if this is more capable in coding than 5.3, which seems to be the economically valuable capability at this time.In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.
wahnfrieden1 day ago
No Codex model yet
- minimaxir1 day ago
 GPT-5.4 is the new Codex model.
 - nico12071 day ago
 GPT-5.3-Codex is superior to GPT-5.4 in Terminal Bench with Codex, so not really
 - conradkay1 day ago
 General consensus seems to be that it's still a better coding model, overall
 - koakuma-chan1 day ago
 It just released, how is there a general consensus already
 wahnfrieden23 hours ago
 some non-employees have been using it for a while already
 - wahnfrieden1 day ago
 Finally
oytis1 day ago
Everyone is mindblown in 3...2...1
ignorantguy1 day ago
it shows a 404 as of now.
- minimaxir1 day ago
 Up now.The OP has frequently gotten the scoop for new LLM releases and I am curious what their pipeline is.
 - Leynos1 day ago
 Guess the URL and post at 10 AM PST on the day of release.
 - bdangubic1 day ago
 curl the URL <a href="https://openai.com/index/introducing-gpt-5-" rel="nofollow">https://openai.com/index/introducing-gpt-5-</a>? until you get 200
 - mudkipdev1 day ago
 Probably refresh the api models list every couple minutes instead. No one could have guessed the name of GPT-Codex-Spark
petetnt22 hours ago
Whoa, I think GPT-5.3 Instant was a disappointment, but GPT-5.4 is definitely the future!
lacoolj5 hours ago
lol yet another pat on their own backs without comparison to other frontier models.Also, the timing of this release, 5.3 and 5.2, relative to the other releases, feels more like a bug fix than something "new"
iamleppert1 day ago
I wouldn't trust any of these benchmarks unless they are accompanied by some sort of proof other than "trust me bro". Also not including the parameters the models were run at (especially the other models) makes it hard to form fair comparisons. They need to publish, at minimum, the code and runner used to complete the benchmarks and logs.Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.
fernst1 day ago
Now with more and improved domestic espionage capabilities
thefounder1 day ago
Is it just me or the price for 5.4 pro is just insane?
simianwords1 day ago
What is the point of gpt codex?
- catketch1 day ago
 -codex variant models in earlier version were just fine tuned for coding work, and had a little better performance for related tool calling and maybe instruction calling.in 5.4 it looks like the just collapsed that capability into the single frontier family model
 - akmarinov1 day ago
 They’ll likely come out with a 5.4-Codex at some point, that’s what they did with 5 and 5.2
 - simianwords1 day ago
 Yes so I’m even more confused. Why would I use codex?
 - joshuacc1 day ago
 Presumably you don’t anymore if you have 5.4.
 - energy1231 day ago
 You choose gpt-5.4 in the /model picker inside the codex app/cli if you want.
peq4220 hours ago
more useless slop machines
minimaxir1 day ago
More discussion here on the blog post announcement which has been confusingly penalized by Hacker News's algorithm: <a href="https://news.ycombinator.com/item?id=47265005">https://news.ycombinator.com/item?id=47265005</a>
- dang1 day ago
 Thanks. We'll merge the threads, but this time we'll do it hither, to spread some karma love.
raphaelmolly86 hours ago
[dead]
raphaelmolly86 hours ago
[dead]
Smart_Medved12 hours ago
[dead]
aplomb102622 hours ago
[dead]
mkelsey_dev16 hours ago
[dead]
Smart_Medved1 day ago
[dead]
daniel_mercer18 hours ago
[dead]
jeff_antseed17 hours ago
[dead]
ruhith16 hours ago
[dead]
builderhq_io8 hours ago
[dead]
shablulman1 day ago
[dead]
readytion1 day ago
[flagged]
chromic048501 day ago
[dead]
chromic048501 day ago
[dead]
leftbehinds1 day ago
[flagged]
leftbehinds1 day ago
some sloppy improvements
elmean1 day ago
[flagged]
- spiralcoaster1 day ago
 This is the low quality reddit-style garbage that gets upvoted on HN these days?
 - zarzavat1 day ago
 What are we supposed to talk about in this thread exactly? The developers of this model are evil. Are we supposed to just write dry comments about benchmarks while OpenAI condones their models being deployed for autonomously killing people?Yes I'm sure it makes a very nice bicycle SVG. I will be sure to ask the OpenAI killbots for a copy when they arrive at my house.
 - esalman1 day ago
 While low quality, it is extremely important, potentially historically significant too.
 - Someone12341 day ago
 If it is actually that important, then maybe more effort should be made so it isn't "low quality." Cannot be very important to them if they're disinterested in presenting an intellectually compelling argument about it.PS - If you think I am not sympathetic to what they're raising, you're very much mistake. But they're not winning anyone new over their side with this flamebait.
 - patcon5 hours ago
 Sometimes you throw a brick through a window, not because it's an intellectual thing to do, but because of the hundred people who'll maybe smash the next hundred windows after you do yours.and then, because any supportive response to all that window smashing is informative as collective intelligence...and then, bc that all validates that the order that all these clever rules were upholding is illegitimate.It's how a very stupid thing stands in for a million smart and well-understood things that everyone is also trying to say.
 - Sabinus1 day ago
 You can say your piece about how you don't like OpenAI working with the US military on lethal AI without making Reddit style quips.
 - Nicholas_C1 day ago
 The HN of old is no more unfortunately. Things get up or down voted based purely on political alignment.
 - karmasimida1 day ago
 As programmers become intelligently irrelevant in the whole picture, you would see more posts like this
 - elmean1 day ago
 "This account belongs to a lazy person" true
 - elmean1 day ago
 I was just reading the model card...
 - mycall1 day ago
 True and simply vote it down.
 - elmean1 day ago
 mycall would also be to do the same
 - rd1 day ago
 Noticeably yes much more than usual. It’s quite bad. I need to start blocking accounts.
 - fhcbcofkf1 day ago
 [flagged]
 - mycall1 day ago
 You are applying a problem which every AI company has, not unique to OpenAI. What about other nation-states making auto-AI robots which kill children, will you still choose to pick out OpenAI specifically? Maybe your concern is too late and dozens of countries already are training their own AIs to do that or worse.
 - bigfishrunning1 day ago
 This company sucks, what about all the other ones that suck hmmmmmm?All of these VC funded AI companies are bad. Full stop. Nothing good for humanity will come of this.
 - Computer01 day ago
 You underestimate my capacity for broad hatred
- timedude1 day ago
 Absolutely amazing. Grateful to be living in this timeframe
- bramhaag1 day ago
 What makes you think that they see bombing civilians as a bug, not a feature?
 - elmean1 day ago
 first real comment, I thought that at first but this could lower the possible users that could be using chatGPT and that would be against us (shareholders)
- throwaway9112821 day ago
 what a thoughtful comment! HN is so low quality these days
- oklahomasports1 day ago
 Evidence
- skilltissue1 day ago
 Don't use the site this way.<a href="https://news.ycombinator.com/newsguidelines.html">https://news.ycombinator.com/newsguidelines.html</a>
 - Chance-Device1 day ago
 You made a burner account just to scold this guy? Don’t use burner accounts this way.
 - louiereederson1 day ago
 I think for your comment to follow the guidelines, you need to explain why the original comment did not follow them.Customer values are relevant to the discussion given that they impact choice and therefore competition.
 - patcon1 day ago
 Not all rule-following is noble or wise.
 - elmean1 day ago
 AINT NO PARTY LIKE A GARRY TAN HOT TUB PARTY
 - himata41131 day ago
 news guidelines
 - adamtaylor_131 day ago
 Parlay?
- Chance-Device1 day ago
 Ironically this would actually be a good thing. As we can see from Iran Claude doesn’t quite have these bugs ironed out yet…
 - MSFT_Edging1 day ago
 This is the exact attitude that lead to a chat bot being used to identify a school for girls as a valid target.The chatbot cannot be held responsible.Whoever is using chatbots for selecting targets is incompetent and should likely face war crime charges.
 - bananamogul1 day ago
 "that lead to a chat bot being used to identify a school for girls as a valid target"Has it been stated authoritatively somewhere that this was an AI-driven mistake?There are myrid ways that mistake could have been made that don't require AI. These kinds of mistakes were certainly made by all kinds of combatants in the pre-AI era.
 - Chance-Device1 day ago
 Do you think anyone is ever going to say this under any circumstances? That Anthropic were right and they were proved right the very next day?Yeah yeah, they probably had a human in the loop, that’s not really the point though.
 Sabinus1 day ago
 Targeting and accuracy mistakes happen plenty in wars that aren't assisted by AI. I don't think it's fair to assume that AI had a hand in the bombing of the school without evidence.
 - Chance-Device1 day ago
 What attitude exactly are you talking about? The one that says that if you’re going to morally sell out it would be better if you at least tried not to kill children?
kotevcode1 day ago
[flagged]
- sd91 day ago
  Please stop spamming HN with LLM generated comments.
  - Havoc1 day ago
    I guess he picked the wrong model to route to…
woeirua1 day ago
Feels incremental. Looks like OpenAI is struggling.
- nobody_r_knows1 day ago
  [dead]