I'm a convert. I was 100% skeptical about LLM code generation, now over 80% of the professional code I write is generated.<p>That said, the limitations are kind of obvious and are starting to show in some of my projects, and this article seems to confirm my suspicions. If it's just confirmation bias or not, I can't say yet.<p>In my experience, for anything complex enough, I have to start adding more and more constraints, style guides, corner cases, error handling, optimization guidelines and all this good stuff to my Markdown specifications, rules and skills. At some point this starts to look like we're all just moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language. The writing speed gains are enormous, yeah, and business sees this as productivity gains, of course - and we do it because the pressure for increased productivity is there, as it's always been; yet the trade off seems to be clear and a lot of people are just ignoring it.
> moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language<p>It's like using a compiler that generates semantically different code every time you run it. Basically like compiling a program that's full of UB but "seems to work" most of the time.<p>> business sees this as productivity gains<p>Back to LoC/s as a measure of "productivity."
> Back to LoC/s as a measure of "productivity."<p>IMO this doesn’t follow from what OP wrote.
I personally measure it with a more abstract “how long does it take me to ship something that is useful in production and solving a real problem” and the increase in speed there has been massive for me.
But of course I’m not a bigbrain 10x coder that is doing bleeding edge novel stuff like most people here, so gains might be more obvious for me than for others.
if 80% of code you are generating is from llm then you are merely remixing whats out there. Aka slop.<p>llms cannot generate anything novel.
And how much of pre-LLM code was just copy pasta from Stack Overflow?<p>Code doesn't need to be novel to be useful. There's a reason why design patterns are a thing in software.
> you are merely remixing whats out there<p>So basically 90% of programming in an enterprise environment? lol. Sounds useful to me...
LLMs recently solved a major, famous open mathematical problem in combinatorial geometry:<p><a href="https://www.reddit.com/r/math/comments/1tj534d/openais_internal_model_disproves_unit_distance/" rel="nofollow">https://www.reddit.com/r/math/comments/1tj534d/openais_inter...</a>
Unless you mean remixing the alphabet / tokens, this is mathematically false. 2^256 gets you to unique very fast, and that’s like 252 bytes or like 80 tokens. Remember almost all numbers are irrational. Complexity is infinite.
Most code written are not novel. Actually most code written should not be novel. Eg: the number of lines of code written isn't spent on writing git, it's spent on writing that landing page no one ever sees.
There is nothing new under the sun.
they can take novel things. so my novel
If we were doing novel things, we’d be scientists. I’m an engineer though. I don’t think I’ve been writing slop for 30 years.
[flagged]
“Our systematic study exposes a phenomenon of constraint decay in LLM-based coding agents. While current models excel at unconstrained generation, their performance drops when forced to navigate explicit architectural rules. For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.”<p>One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.
I think it's downstream of "you can't optimize for two different objectives".<p>If you only have functional requirements, then in effect you're doing some form of program synthesis, and RL can optimize that very hard.<p>If you have a mixture of functional and non-functional requirements, you are basically giving the model an incomplete specification, and it must in some way guess at the user's intent to fill in the blanks. This is also why adding to the prompt examples of the style of code you want (hats off to antirez for this particular tip ;)) is phenomenally powerful.
> ... This is also why adding to the prompt examples of the style of code you want ...<p>You could take it a step further and put the example code into source code files...and be like, <i>super</i> comprehensive with your examples ... ;)
Would you mind sharing antirez' suggestion?
I am obviously paraphrasing, but the general idea is that trying to synthesize style from a codebase into e.g. a markdown guide generally doesn't work very well. What achieves style transfer is providing the model with a lot of examples of the style, conventions, patterns you want.<p>To put it in practice: if you point claude/codex to a repository and you ask it to implement feature X using style guide Y, the code will probably work, but you can usually get better results by saying "do it in the style of this file, it was done well there".
Right more simply put it's great at being a copy cat, exploring similar data points that match your token needs.<p>It is not great at decision making or judgment calls that don't have a well defined spec or plan in place yet; like unofficial or unapproved tokens if you will. A lot of this stuff simply never has had specs as it has been internal to how companies work and their secret sauce.<p>The closest thing we have are governance and compliance policies due to legal/business needs requiring it so it's far more well documented than operational ones in how we work. It is more about the how versus the what here I guess is what I'm saying.<p>But yeah this is why it does great when there are tests, design systems, evals, and other artifacts to mirror. Far more reckless and unpredictable without these things, but still great for exploration and finding the data output you seek.
Doesn't that make sense? Its text prediction. If you give it examples, it can predict. Synthesizing "put semi-colons on new lines" requires it to generate its own examples 'in its head' (so to speak) and remember that. It won't.<p>It's like when I see people feeding it a whole bunch of "best practices" and expect it to follow them. It won't. But you could ask it questions about the best practices all day long.
Yes, exactly. Any engineer deep on this stuff right now understands that grounded predictive engine sprinkled with RL training and are discovering what that means in terms of its strengths and weaknesses for company use.
Supposing an unspecified or poorly specified function f(x), and example "f(A)=>B", "given C tell me what f(C) is" lies at the core of creativity.<p>Idk, calling it "just text prediction " seems unfairly dismissive of this capability
Saying that it’s dismissive is like saying writing (insert language) is dismissive that you’re just writing assembly.<p>at the end of the day, it presents a vector field and predicts the next vector. That’s literally the heart of intelligence just like assembly is the heart of execution. When playing table tennis, your brain is literally predicting seconds into the future to get your body into the right position.<p>But we aren’t discussing intelligence here. We are discussing how best to utilize that intelligence.
You're making my point for me, saying table tennis is "just a proprioceptive predictor" is dismissively reductive (and not a particularly useful framework for understanding table tennis), even if it is strictly speaking accurate. It's the sort of thing someone who has no idea how hard training for table tennis is would say.
I ran into similar issues as we started to roll out LLM generated financials in our org.. I’m so used to the old SQL workflow of “grab this data from this table, that data from that table, combine it into a final result that looks like xxxx” where the tables were outputs from reports in our ERP but I was having terrible results.<p>Ended up pointing Claude at a few sample files from our existing reporting, gave it read-only oauth access to the ERP and said “build a new report showing the cash by project as calculated by xxxx - yyyy + zzzz in the style of the existing reports” and it basically one-shot from there.<p>Kind of crazy and I built a bunch of redundant check-sums because I honestly didn’t think it would be able to replace like 6 workdays of effort for the 2 FTEs who generate that kind of thing manually every month but so far so good..
I was recently using Copilot to implement a small feature within a very large codebase. About 75-80% of the time, the code that was added matched the current style (warts and all). Copilot would specifically go off and research "How X is already done in the codebase" all the time.
You basically get this for free, if the coding agent has read the relevant classes that the legacy code its touching has to match.<p>just dont break out a plan without also having it read the code again
[flagged]
I've noticed something similar with AI assist authored books as well. Early on it does alright, but after some chapters the beginning of each chapter repeats the end of the previous, and obvious LLM tells become more frequent.<p>The more it has to go on, the more it relies on repetition of what came before. It's also possible that authors start paying much less attention and put less effort into editing later chapters.<p>Despite the sheer volume on Amazon, LLMs are not at the point of writing well.
Hmm, I have some anecdotal evidence this is true. Interactively working out a plan with Opus on multiple occasions it'd come up with an incompatible solution, I'll add additional context/requirements, and it has a tendency to "anchor" on it's original architecture and struggles to adapt. Sometimes it tries to sneak in changes for the original plan anyway.
Opus does this waaaay too much for my taste. It works fine for vibe-coders but for technical work it is infuriating.
I think the problem is they take the shortest path to the goal ...which may or may not coincide with what you have planned. Oh, and generally think instructions are merely suggestions and what you really want this this totally different thing and not the one in the plan you handed them plus, as a stoke of good luck, this other system is a lot easier to implement as well.<p>I mean, I spend more tokens having them clean up all the places they didn't follow the the plan (if I catch it) or implementing what came out of a 'complete and tested' previous plan where they just stop as soon as all the pathetic new test pass and you discover half of it isn't even there when trying to implement the next thing on top of it.<p>Though... I have been conducting an experiment, of sorts, where we've been cooking on these fairly complicated projects and I don't ever touch a single line of code, just yell at them a lot, and with suitable amounts of marijuana (they are <i>very</i> frustrating most of the time) it's been going pretty well. I also helps that they need to explain what they're doing to somebody fairly-baked -- maybe not such an HR friendly plan?
That may be the same problem seen when prompts try to force "alignment" or "guardrails". There's a performance drop. Seemingly, a big chunk of the potential solution space has been made unreachable.<p>For example, if you apply "guardrails" to an image generator of about a year ago, all the people start looking alike. Story generators start using only a few standard names.<p>That was last year. Is it happening with the frontier models?
Even the strongest frontier model they used - GPT 5.2 - I would consider barely usable for agentic programming.<p>I’m not really interested in analysis of the weaknesses of such models because in my experience many weaknesses disappear entirely as models get stronger and reasoning effort is turned up. Especially if you tell them what you want them to do.<p>Also, it’s not surprising to learn that when more acceptance criteria are added the failure rate increases.
Wait isn't gpt 5.2 good? Or is it not thinking / not codex? 5.2 was what sparked the late 2025 openai agentic programming revolution.
5.2 still had a Codex variant, which this doesn't describe using. It also notably is not using the Codex harness -- it does everything with open source harnesses (which obviously are worse). And while it uses two harnesses with its cheap models, it only uses the worse-performing one of those with GPT 5.2 for cost reasons. (They also don't specify effort/thinking level used for GPT 5.2, but given that it performs worse in their baseline testing than obviously non-SOTA models, I'm guessing it wasn't set to anything high.)
Oldheads remember when GPT 5.2 was at the forefront of agentic programming. December 2025 feels like eons ago, but alack it was an entire half year!
> their performance drops when forced to navigate explicit architectural rules<p>Even the best models have trouble adhering to stuff as mundane as rules for how to style generated code (indent this much, name things with these patterns, etc.). Even the most die-hard AI-first coder will admit to that kind of stuff being not unheard-of. Yet they still delude themselves into thinking that these models will follow a sufficiently detailed spec to the letter, every time.
Reminds me of the recent paper about delegating document editing tasks to LLMs across different disciplines [1]. That paper found that programming was the only discipline most LLMs can perform long horizon tasks on without accumulating errors & corrupting the document.<p>I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.<p>[1] <a href="https://arxiv.org/abs/2604.15597" rel="nofollow">https://arxiv.org/abs/2604.15597</a><p>Discussion: <a href="https://news.ycombinator.com/item?id=48073246">https://news.ycombinator.com/item?id=48073246</a>
Not those carefully designed constraints that I set up from the beginning, but short-term ones that I came up with after an agent failed in some way: "Validate JWT at the route level, not the component." "Call workspace provisioning on each user creation." Both because of things the agent had done incorrectly.<p>Aspiration vs. consequence, in other words. An aspiration constraint describes a desired outcome for the system; a consequence constraint maps to a problem already encountered. And the agent ignores the former when faced with the path of least resistance while obeying the latter because it is brief, unambiguous, and precise about preventing that particular failure mode. Which is key rather than the harness in determining survival through session rotation.
I've been experimenting quite a bit with long-horizion agentic coding[1] and I have also noticed that agents seem to perform worse when forced into certain architectural patterns. I have found that is a bit better when including the constraints along the way instead of adding them after the fact. There seems to be a side-effect I have been calling "calcification", where a pattern starts appearing in the codebase and the agent follows the pattern to the point where it dominates the context and becomes self-reinforcing. This could potentially be a strength or a weakness for existing code bases depending the codebase quality. I will have more insights on this soon as more from-scratch runs conclude that include architectural guidance from the beginning.<p>[1]: <a href="https://medium.com/@vishvananda/i-spent-2-billion-tokens-writing-a-c-compiler-so-you-dont-have-to-d3e4eec4781e" rel="nofollow">https://medium.com/@vishvananda/i-spent-2-billion-tokens-wri...</a>
> agents seem to perform worse when forced into certain architectural patterns.<p>FWIW I've noticed this too. I've found that the agents/models have their own style, which is mostly summed up as overly verbose.<p>Additionally, the models are OK at modularization when given space to "plan" their implementation, but rarely decide that abstracting something would be helpful after the fact (i.e. after many iterations on a greenfield codebase or when being dropped into a legacy codebase).<p>This often leads to "god files" which, when pointed to by the user/architect, causes the models to correctly critique (humorously when they're the ones that wrote the code in the first place).
Also they used languages with dynamic typing like Python & JS. In my experience a statically typed codebase is easier to maintain for humans so maybe it is also for agents.<p>When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.
I am trying to avoid this by building a plugin based on my memory management project Aristotle. I add a status machine to monitor the activities of LLM while it does jobs following my tdd-pipeline skills, which begins with requirements clarification and ends up with delivery.<p>These two projects are on GitHub, you may search alexwwang/aristotle and alexwwang/tdd-pipeline to dive into the details or just ask your LLM to scan them to tell you the points you are interested in.
This sounds like another version of "As a chat becomes longer, the guardrails seem to become fuzzy". You can't use all of the context window bc at the end, the output would not respect the constraints (or guardrails) but to reliably produce production grade code you want the model to have expansive awareness which fills up the context window pretty quickly. It's like saying "Keep everything in mind from these 6 directories - and make this <insert ticket> change" - but keeping everything in mind already fills it's context window which makes it lose it's ability to follow the constraints (or guardrails).
This is not a new problem though. This is why we started writing modular code, strict interfaces etc
So give harder guiderails? Sonarcube and the like. But I guess then the failure mode would be appeasing the linter while slowly forgetting the requirements... (or not so slowly, because the try/fail loop won't be nice to context at all..)
I recommend spending some time getting a few parts of the codebase idiomatic and then @-ing those files as exemplars. This works a lot better than trying to steer it with markdown. This works reasonably well for like FastAPI but JavaScript seems to be the worst, even with guidance and exemplars it'll prefer in-lining a bunch of garbage rather than use the APIs as directed.
<p><pre><code> tasks spanning eight web frameworks
</code></pre>
Does anyone else have this experience that LLM create better pure html+CSS+js than work with existing frameworks?
So my finding is:
planning is worth it.<p>For a little complex changes, I always run codex (5.5-high) in planning mode first.
I have linked various docs/{ARCHITECTURE,BACKEND-GUIDELINES,NESTJS-DI,..}.md etc. from AGENTS.md so they can quickly discover relevant docs at planning time, only if they are needed. No need to know react specific stuff when it's dealing with a backend problem for example. I typically blindly approve plans made by the agent with a fresh context, because that's as if I had prompted it. Works the best for me.<p>Using /goal however, it's really just constantly compacting and doing it's thing, of course it gets sloppy. If only there was a state machine that would transform tickets into a Planning Mode Prompt, then use, idk. guardian approvals (somehow a "Product Management Perspective Lens" approving or making changes to the plan) and then letting a less capable or less reasoning agent execute the plan, I think that would work the best.
I've been building <a href="https://engine.build" rel="nofollow">https://engine.build</a> to introduce a proper structured external agent orchestrator that's used to build with clear constraints and make sure the end result is what you wrote in your spec or requirements. Without having to babysit and micromanage the models.<p>Implementation phases very often go through 5-10 review and fix rounds to actually get the implementation to match the spec. It takes longer but that's what's necessary to get actually good results on long horizon tasks with detailed requirements. I'll be open sourcing it fully soon.
Odd they used GPT-5.2 and not GPT-5.2-codex. i.e. the one optimized for coding agent tasks.
Considering this is from academia, there's a chance there were limitations on the available models. My research group accesses OpenAI models via Azure, and until recently (last week) the latest model was GPT 5. We just got 5.4.
This is why we as an industry have spent so much effort optimising the code generation process with things like skills, rules, tests, reviews, lints, agentic loops with feedback and sub-agents, and the code-runners. It is not just LLMs building code, it is an eco-system collaborating together.<p>I would agree too that as the codebase grows the LLM struggles more and more with generating code. It is probably misaligned incentives, it wants to complete the isolated task without too much context consumed, at the POC it can consume most of the app, by about 30K lines of code it is quite complex code base to navigate.
I think someone is going to figure out a framework for using LLMs for coding.<p>A framework would use static code checking tools to force an architecture on to LLMs instead of trying to do so in markdown.<p>I don't know exactly what it will look like but for example I could imagine a Java Framework where the LLM could only create subclasses of certain classes.
These things don’t think. We’re going to have to reiterate this for a long time, I fear.
…but they reason well enough given enough context (using their matmuls).
There is now a trillion-dollar industry bent to the task of convincing people these things can think. It’s gonna cause some damage.
There is a movie, Gold (2016), about a fake gold mine. One of its founders is a true believer: he found a few chunks of gold and started digging for more. The other founder is a nihilist: he realised that there is no gold there, but who cares if he makes the investors believe? So he does, and almost sells the company for $300M.<p>In our story, investors are mining intelligence from GPUs, and they truly believe they are one inch from discovering the biggest goldmine in history. But GPUs, unlike a goldmine, cannot be inspected for traces of gold by independent contractors. To keep the hype up, the nihilists in our story dig up cheap gold-looking metals from time to time and tell investors that with a bit of alchemy - agentic workflows, etc. - those metals can be magically turned into gold.<p>Investors will keep digging until the end of the age, or until they run out of money.
> For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.<p>Time to start writing linting tools that check the architecture and spoon feed the LLM what exactly it's doing wrong.<p>I reckon something like this would be good for every project out there: <a href="https://www.archunit.org/getting-started" rel="nofollow">https://www.archunit.org/getting-started</a><p>They expand a bit more on the reasoning behind it: <a href="https://www.archunit.org/motivation" rel="nofollow">https://www.archunit.org/motivation</a><p>(I also wrote a simple linter for architecture/code checks that aren't well encapsulated by ones that just focus on individual files, that uses Go + goja to write rules in ECMAScript and parallelize the read only ones and also allow ones that change files as necessary, in addition to something like Ruff / Oxlint / Oxfmt / whatever is present in each stack; though it's is still in development and not as good of a focused example as ArchUnit is)<p>If we write software specification docs, bother describing how it evolves with ADRs, enforce code style automatically and require certain test coverage automatically (or at least should), why couldn't we go a step further, formalize those specs and ensure that any new code is also up to snuff? I don't think that's any more of a job for an LLM, than telling it how it should format code is. Also, I'm in the camp that believes that at least many of your ORM mappings and similar stuff should be the output of codegen, since you've already gone through the trouble of describing the schema/migrations to get there.<p>I don't think this would be <i>only</i> good for LLMs, though - I've seen projects that have like 3 different audit systems built in, not because of some fancy business requirement, but rather cause the devs either didn't know about the previous one(s) or just didn't feel like following what should have been the pre-established conventions, even when there were docs in place (nobody read those).
Fragility compounds fast when you add visual grounding to the loop. Code agents at least get structured feedback from the compiler.
> Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline.<p>I have exactly the inverse findings on my end. The bigger and more legacy the codebase, the more accurate the patches become.<p>The harness itself seems to be the most important part. I use a recursive loop that primes the root context based on the user prompt each time. My agent will often make over 100 tool calls to sql and git before it finally decides to apply a patch. If I was greenfield, there would be nothing to query or constrain against.
I find the same. We have abstractions with multiple concrete implementations, examples of patterns and examples of anti patterns.<p>I usually find I can achieve 90% of the outcome I'm trying to achieve. I use sonnet for planning, qwen for coding, sonnet for review.
The harness mattering more than the model lines up with my experience too. What this paper measures is within-turn constraint decay. The version that bites in multi-agent setups is across-session — the architectural rules an agent wrote down on Monday don't reach the agent making the next change on Tuesday.
As a codebase grows, divergent structural emergence from incidental(lang and lib) details results in prolonged complexity costs. I'm working on a language that enforces structure for agents: <a href="https://github.com/hale-lang/hale" rel="nofollow">https://github.com/hale-lang/hale</a>
This is interesting, anecdotally I have felt like I was having better luck with raw sqlite than using an ORM in a recent typescript project, using raw sqlite queries vs drizzle
Exactly why you can't remove humans in the loop to assess that the solution is not only correct (which LLMs are quite bad at, once concurrency, logic, etc are involved), but also elegant, maintainable, etc
"constraint decay" isn't this just another name for the (already well understood) idea of "context rot"?
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
This research is useless and nearly all other LLM research is too.<p>gpt 5.2 is the strongest model they tested, a nearly 6 month old model.<p>Traditional research can not keep up.
Agreed. As Simon Willison points out, November 2025 was a a critical months because that's pretty much when coding agents became «good enough», eliminating most of the problems pointed out in this study.
I disagree, their findings should generalize to the frontier. Even if the latest can deal with the extra complexity, it stands to reason it will take more tokens to do less. This could be a useful insight into the next generation of evals.