What I still can't understand is why is massive amount of code generated is a flex? I don't feel that software has gotten a lot better in past 3 years, only sloppier. It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality. I'd argue you have to optimize for less lines generated as possible while secondary optimization should be readability for humans. I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.<p>And if I am working on an existing codebase then isn't a good commit often a negative sum between added and removed lines? I don't want to bloat my codebase but make it more polished and elegant. After reading that I wonder if what they have done could have been accomplished for a far fewer LoC budget.
> I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.<p>To generate elegant code with more restrictions, it means more thinking tokens and more stronger adherence to instructions. So tha naive view that they are doing it for billing is wrong.
I think it's a rebuttal against claims that LLMs are incompatible with large code bases. It's not so much a flex about the quality of the code, it's more a flex about the complexity of the code and the LLMs ability to deal with such complexity.<p>Whether or not that complexity is warranted is a different story.<p>The codebase may be bloated by a factor of 10 but if the costs associated with that are less than the costs of developing the software from a business standpoint the choice is clear.
Lines of code has always been a terrible metric. But all else being equal it is a measure. If all else is not equal, which is usually the case, then it's not.<p>A lot of the focus has been on AI recently.<p>Three years ago we didn't have software where a non-software engineer can describe what they want in English and get working (-ish) software generated by other software? Is that not "software has gotten a lot better"?<p>Other than that I'm not sure how we measure "software has gotten better". New applications? More features? How do we measure sloppier? Is Google Maps suddenly taking you the wrong way more often? I'm not really doubting your subjective experience but seriously how do tell? I mean a doc is a doc and a spreadsheet is a spreadsheet.<p>We're also only about 10 months into models that are powerful enough to potentially make a bigger difference and we are still figuring out how to use them best.
I personally don't view coding agents making software as "software gotten better" you are comparing a tool and the end result, these are two different things. Agent you use going down and your product going down mean two different things to you customers. I will not deny that we made incredible progress in coding and hell, even design over the past 3.5 years, this technology is here to stay.<p>That being said while I agree that measuring better quality of software is vague (part of the reason it is hard for models as well), there are universal things I believe every engineer will agree on. Reliability, uptime, customer feedback, legibility of your engineering, performance, these are things we often optimized for. Google Maps is a bit of a strawman because neither of us (unless you work on it), knows how much agent code there is, I think it is likely that it's little since it was working fine prior to 2023. I could bring up github reliability as an example, given how much copilot usage they promote at MS, but once again only folks there know for certain. I do, however, see scores of various AI powered SAAS that looks like it is in a perpetual MVP state. I think you are right in that even if agents give us "good enough" results and we can swallow failure rates and our increasingly lesser understanding of what we, or more so model, created, then it is still progress overall, but this is progress not to human-AI collaboration but to AI-only engineering IMO, this is good or bad depending on how you view the future.<p>I'm a scientist and most of code I currently write is somewhere on the intersection of critical software and machine learning, squaring these two is not easy and I guess the way I was taught to reason about engineering informs my opinions on this. Maybe it's just a matter of time before codex can help here in an unconstrained manner as well, but I am skeptical at the moment.
I also can't help but notice they didn't mention how many tokens were burned, or how much that translates to in terms of cost over the 5 months at enterprise AI prices. I'm going to guess this wasn't a cheap demo.
> Is Google Maps suddenly taking you the wrong way more often?<p>Funny you mention that because I had that issue in a cab just yesterday. Google decided to drive us of the main road to a series of small roads which happened to be a dead end. My guess is that the AI decided that this is a shorter road? less busier road?<p>That being said, Google maps have been gradually degrading. Most notably, its search function is quasi-broken now.
The "lines of code" at this point are basically the same thing as binary code that comes out of a compiler - something you almost never look at and certainly won't try to touch by hand.<p>The actual "code" is everything driving the harness.<p>The current problem for this is that the harness is not (yet) deterministic, so it's sort of like having a compiler where your output program works slightly differently every build, and then the compiler tries to just patch the binary programs when you recompile to minimise this problem, or even worse, disassembles the whole thing to figure out what it does, makes the chance, and then recompiles it.
I think the telling part is in this line:<p>> Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility<p>I asked a question from a perspective of a human engineer, as in, I will have to read the code and understand, fix it once it breaks. OpenAI approach is opposite, even if it is breaking it is the agent that will be doing the fixing, millions of lines and inelegant designs don't matter because human readability doesn't matter. In any case you use more tokens so you fork over more money.<p>I will say, however, that IMHO there is objectively bad and good code in terms what it can do and performance, if I can do the same thing in 50 lines as opposed to 1000 lines, this difference still matters for the model. Smaller context usage, better approach that informs downstream generation.
This is the part I think we will see become more relevant.<p>I created docs-cli (pypi) to manage the index of specs as source code: the framework that goes with it will first create tests for as much as it can, so reproducability becomes the goal, not readability.<p><a href="https://github.com/ArtRichards/docs-cli" rel="nofollow">https://github.com/ArtRichards/docs-cli</a><p><a href="https://artrichards.github.io/agent-playbook-suite/blog/" rel="nofollow">https://artrichards.github.io/agent-playbook-suite/blog/</a>
Well, to be fair, the amount of goalpost shifting that is going on is quite intense. AI not being able to work in a "serious" project, and being limited to "toy projects" has been a long standing critique.<p>But also, bigger projects need some amount of loc written and it's a bit silly to pretend that this is not the case or a bad thing.<p>So the answer to the question is roughly: Establishing that an agent can work in a large-ish code base is valuable, because 1) them not being able to do so has been a critique and 2) it's something that is required for a lot of software projects.
>I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.<p>I have also grown skeptical of token usage in order to run up my bill! But since I feel like it takes me MORE effort to write LESS lines of code myself, I'd expect a quick and dirty AI-generated solution to be MORE lines of code and cost LESS to generate than a concise/elegant solution in LESS lines of code.
The latest frontier models will write code better than you and more elegant, with less lines of code, in 100th of the time, with full test coverage. Hand coding is like writing out assembly/machine code rather than using a compiler.
> It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality.<p>The simple answer is that promoting locs as a relevant metric is also reward hacking. Is it easier to promote big loc counts as a key metric, or is it easier to prove agentic engineering against harder metrics?<p>On a more general note, software practice marketers have been pushing in that direction for quite a while. "You need cloud", "Here's how to do agile at scale", "microservice everything", etc.
It's not massive amount of code for the sake of it, it's that it built a complex large piece of software itself. If you compared the linux kernel LOC to some guys toy kernel on GitHub the LOC difference should give you an idea on the complexity. "Yeah why is Tolstoy flexing about his number of pages in his book, I wrote Spot goes to school and its only 20 pages"
Yeah so all of personal computing—text editing, SVG antialiasing, etc, fits in 20,000 LOC (VPRI's STEPS project) so a million lines of code is 50 reïnventions of personal computing. BUT: it is unlikely that humans would have solved this problem in 20 kLOC. Sussman said “we really don't know how to compute!” as his talk title and LLMs had to ossify some pre-existing voice as the forever programming <i>habitus</i> and it chose a persona that doesn't know how to program—because we don't —and now we are stuck with it. Claude is our tickets, our implementations, our documentation... And if you tell it “hey the node role should not have those permissions, that should be a service account” it will happily do the right thing, but it has no intrinsic sense of taste and the error message it's trying to clear just says “the node role doesn't have that permission and the system prompt says “keep it short, stupid ” and graybeards might be our last bulwark.
I've heard it said that measuring productivity of a software developer by lines of code added, is akin to measuring the productivity of an aerospace engineer by mass added.<p>It is a metric. It is often not a good metric. But it is easy to measure.
People want to do X, so the metric is how much X can be done.<p>Everyone is over-complicating the explanation. The answer for "why are we fixating on this bad metric" is almost always the same pattern.<p>Broad audiences need simple metrics to talk about. If the metric itself requires nuance, it's hard to communicate and hard to reason about. It's easier to push the need for nuance from understanding the metric itself down the road to where the metric is applied, which allows everyone to ignore it in immediate conversation.
It is a very valid question. My intution (no grounding) is to the model training. Optimizations traditionally have worked well in human wrote software with either experience of the developer , usage of architectural patterns or a second ir third pass of fine tuning. In case of model written code - (e/p one token at a time), only possible orchitectural optimization is either with a strict guardrail on patterns to use for a specific implementation OR by giving a second or third optmization path. All of which burns more tokens, but can lead to better software.
I don’t think the flex here is the amount of code alone. Their goal is to show that AI can improve productivity, the number of lines is just the proxy to that. This article is a marketing piece after all.<p>Now someone can argue that lines of code are not a good proxy of engineering productivity, but I wouldn’t be surprised if the audience they target with this content is not the HN commenters of this thread.
Don't compare human loc with machine loc.<p>Compare machine to machine (as these headlines come) and discount that by a factor.
You can't really do that here because one of the key arguments for this, as people in the thread focus on is "1/10th of time" estimate, the comparison with humans is here already, albeit it is just an estimate and no actual comparison has been done.<p>This is a problem of conflicting incentives that exists today in my opinion. Companies will market greater human-AI collaboration in science and engineering but focus on releasing things like this where it is clear that downstream goal is complete agent ownership over the product, from inception to testing to monitoring. Maybe the speculative future agents will use their own very efficient language to code that won't be readable for people at all. They focus on agent code being readable by agent in the article, as you've said. But in my mind in at least near future, there is a case where your prod will break, you won't be able to understand it or the attempted fixes. Maybe agent will fail to fix it at all and start a massive rewrite. In any case is this different from kicking technical debt down the road along with worse interpretability of what you have built?<p>I do think there is a way where agent can write great solid code that we can read, but with the way LLMs are built this requires something new in terms of reward that accounts for "taste" and constant refinement so it might take more than 1/10th of a time to produce something good.
Because sloppy code and doesn't matter. He wrote he completed it in 1/10th time. That's equivalent to 1/10th the engineers.<p>That is a business win. That is really all that matters in capitalism.<p>The flex is a direct insult to your face. He is shitting on the faces of all software engineers (me included). It is equivalent to saying we don't need you to code anymore. One man can produce 10x the code.<p>So why am i voting him up even though he's shitting on my face? Because what he says is true. I value honesty and people who say things like it is. Yes my identity as a software engineer is getting dismantled before my very eyes. But the solution to this problem isn't some delusional statement about not understanding what he's flexing about. We're not stupid. Everyone on this thread understands his flex. The difference is some people like you don't want to understand it.<p>Like seriously. He literally wrote it was completed in 1/10th of the time and you expect me to believe that YOU don't know what HE is flexing about? Be real. You're not stupid.
It's a huge flex if the alternative is no code at all. Reward hacking aside, LOC resonates with me in the sense that I've seen 10+ projects to fruition that wouldn't have even begun without an agentic harness and an LLM.<p>It's like the difference between doing stock price predictions with binary "up" or "down" histories and trying to figure out how to normalize actual price histories (basically impossible). The binary work gives a well-defined signal.
I wish these breathless blog posts would actually try to be more didactic.<p>For example, actually doing a walkthrough of how to set up these allegedly super powered workflows and concrete demonstrations.<p>I’m not an AI skeptic. Rather I’d don’t want to miss out on any actual super powers.
I do quite a lot of what this post describes in a reasonably large project. Here's what works for me:<p>- write gherkin features for new features; update them for enhancements; don't touch them for refactors. Label your PRs with these nouns.<p>- use pre-push hooks for type checks, linting, unit tests, and other quick, scriptable validations.<p>- make a viteperess subsite in your repo, have the agents maintain it - document important principles, architecture, etc.<p>- make a cli command which lists all pages along with the yaml frontmatter description so agents can choose what to read without blowing up the context window.<p>- use ddd and monorepo - write your logic in headless layers, and compose layers into apps. agents navigate layers very successfully.<p>- use zod (or your language equivalent) and contract-first API development; this is my favourite bit tbh, I use orpc<p>- make a single skill called "code" which describes the lifecycle: open a worktree, setup .env to guarantee no conflict with other agents (choose unused ports etc - docker is good here), write or update feature file (this is where you negotiate the spec), implement, validate (e.g. using playwright mcp), pre-push checks, push and wait for review, tear down and fast forward main<p>- testcontainers is great for ensuring multiple agents can run tests that don't conflict<p>Seriously I only have one skill that's it. Everything else is in the docs. I'm feeling very productive like this, in a "making good software" sense not a LoC sense.
I agree. I followed this article for a repo I'm working on, and I had a very hard time inferring how, specifically, they implemented "providers" and enforced import layers. A sample repo would've been nice.
> We had weeks to ship what ended up being a million lines of code... Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.<p>That's an insane level of throughput. What's a good baseline? Prior to agentic coding, whats the typical number of PRs engineers were expected to push? Maybe a 2-10?<p>Do people feel the software has gotten better in the last 6 months? The number of engs is prob the same so we should expect maybe 5x faster cycle in major software apps, but I don't see it. The AI apps do change very fast but given its a very new field, I'd expect as much. But outside of that, I don't see it.
Here's a fun one: firefox lists its current count at about 2.5M LOC, from roughly 1M commits during the years.<p>You end up with about 3 lines added per commit, which is not ridiculous when you consider that most would be editions rather than full additions.<p>Here, we have 1500 PRs and 1M LOC, which is about 650 added LOC per PR. Remember, not 650 lines total in the PR, but +650 balance after additions-removals.<p>Fun questions for attentive readers:<p>- What does a project growing at a rate of one full firefox-codebase worth of LOC per year look like, a decade down the line?<p>- What does the line count say about the verbosity of the tool, and what does it say about outcomes that the purpose of the project isn't clearly disclosed?<p>- Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?<p>- If it was confirmed that LLM usage blows up your line count, what's the implication for codebases that want to return to manual coding after months of usage? (Say, because the tool gets expensive).
> - Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?<p>Yes, at least to the extent that we care about context windows and tokens consumed by coding agents processing code that is ultimately irrelevant to their assigned task.<p>Anecdotally, I've found keeping file sizes small has been important for agentic coding not just to maintain human readability, but also for optimizing agent performance, precisely because it limits the amount of incidental context they load while working a problem, because they generally load entire files rather than just parsing the part relevant to their current assignment as a human might. That smaller file size thus reduces input noise and the LLM generates a tighter solution, which in turn reduces input noise for future solutions. Or at least this strategy avoids a death spiral into exploding context length.<p>I expect (but cannot currently prove) that keeping overall LOC down yields similar benefits even when file sizes are kept small because it spares the LLM from parsing potentially relevant files that prove irrelevant to its current task.
Does the Firefox LOC include ALL forms of text: infrastructure (Firefox doesn’t have), documentation, developer scripts,tests, etc? How is the test coverage of Firefox?
When I got to the 1M LOC I involuntarily paused feeling like this must be satire.
They never specified what exactly the product was, without which it's impossible to judge the post.<p>For some reason most of the uses of "agents" are to build yet other AI products, it's turtles all the way down. Maybe that says more about the field of harnesses than it does about the power of "agents".
There is a sense in which it doesn’t matter at all; many of the limitations of agents in large codebases are just the context management challenges. So proving that you can cohere and progress at O(1m) is a useful scale observation. “Can I use agents in my 1m line codebase?”<p>There is of course another sense in which the output quality is the only thing that matters. “Can I use agents to build a 1m line codebase that I want to maintain going forward.”<p>I take this as being exclusively a tech demo of the former. Quality (feature velocity, bugs, scalability) is not demonstrated.
Feels like the active discovery going on is trying to understand what is computer vs what is AI, for every product.<p>Agents help a ton with the discovery, but the act of building a product needs a deeper level of thought and validation to make it actually better than what came before. So IMO what you see is people still learning what needs to be understood and crafted first hand to make a product better (including economics)<p>We’ll get there if more of us try
I’ve been vibe coding a lot over the past year or so, and I think I’m going to stop. In fact, I sort of want to challenge myself to see, can I go back to a sort of the fork in the road with the old copilot autocomplete workflow and really maximize that. Be in the drivers seat for most of the code being written, but find ways to use AI to really enhance the flow state / remove blockers. Tools only minimal actual code generation.
I would be very impressed with someone who's been vibecoding "a lot" for about a year who could then go back to being fully in the loop for even 50%. I would even say I'd expect withdrawal symptoms at that point.<p>The dopamine hits are core to why people even do vibecoding (or vibecoding-in-a-dress/spec-driven development) and why they tend to overestimate its output so much. Hell, it's core to all forms of LLM-assisted development (because it feels like magic), but most of the other forms are more value, less delusion.
> ended up being a million lines of code<p>This almost reeks of "I've never cleaned up our code base because there is too much code, and didn't even bother having agents/LLM cleaning them up".<p>You almost never need a million lines of code - this includes your software, infra, testing and operational tools. You didn't ship the linux kernel in 3 weeks and you know it. The code is already speghetti and it achieve the basic functions OK but it will harder and harder to simplify and untangle and maintain.
Even the linux kernel doesn't need millions of lines of code; most of the actual LOC is device drivers, and you don't need all of them, you just need the ones for the devices you have.
As a point of reference, 1MLOC is about the size of the <i>entire</i> Python standard library <i>including tests</i>, as well as stuff like IDLE. (Well, the Python part of the code. There's about half that much again of C in Modules/ .)
Yeah I cannot see how "we shipped 1 million lines of code in three weeks" is... something to be proud of haha
They directly address routine code cleanup and regularly paying down technical debt near the end of the article.
> should expect maybe 5x faster cycle in major software apps<p>To what end and what would that even look like though? Enshittifying everything at maximum speed? The apps/platforms I use regularly - GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.
>GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.<p>What if AI lets you create new versions of those tools, but without the enshitification?<p>I say that being in the "soaking" stage of using AI to rebuild a shitty software project in 70KLOC over about 2 weeks of spare time, so this may not be as theoretical as you might think.
Oh I definitely agree that AI can and will help create great software.<p>It's just that creating great software isn't really the SV/VC/big tech business model or main goal.
> What if AI lets you create new versions of those tools, but without the enshitification?<p>I'm not sure I fully understand what you're saying here. Isn't the value of these tools almost entirely independent of their actual software? That is, we have many good open source, self-hostable forges (Forgejo, sr.ht, etc.), lots of great music player software (Jellyfin, Symphonium, etc.), and decent maps software (OsmAnd and Organic Maps). People use GitHub, Spotify, and Google Maps -- perhaps even _put up_ with their often bad/glitchy software -- because of network effects (all three) and content/licensing partnerships (Spotify/GMaps). That proprietary data isn't something AI can help you with, right?
It really depends on the use-case. For example, my most starred github repo is a tool to convert Spotify playlists to YouTube Music (that was done pre-AI). Github depends on what issues you have with it, what your use case is, and whether you can leverage some of the network effects via API from the github source. Maps, same story.
AI coders are great for making scrapers, possibly because AI companies use their own tools to make an awful lot of scrapers.
This is a lot tamer than what Claude Code's team claims tbf.
[dead]
It is likely better because AI agents make access to domain knowledge easier. However, I would wager that the problem is people don’t remember the code well. The problems are going to be long-term as the pace of change increases.<p>If you think about it, successful products rely on designing well-thought-out experiences, customer discovery (see all the Forward-Deployed Enginneer job listings at OpenAI) so the code velocity somewhat becomes irrelevant.<p>If you’re solving the right problem and you’ve got a good team then competitive advantage comes from somewhere OUTSIDE of code velocity.<p>The more important question I think is does faster code yield more value long-term? At the moment, it’s like yeah we do 3.5 pull requests per day.<p>I’m thinking, great, good for you. You could also combine three pull requests into one and then you’re doing 1 per day. This is quantitative data that doesn’t really mean anything tangible.
digression:<p>It's interesting this was submitted to HN over 15 times since it was published in February: <a href="https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=https%3A%2F%2Fopenai.com%2Findex%2Fharness-engineering&sort=byPopularity&type=story" rel="nofollow">https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...</a><p>But this is the only submission that's had any traction. Since the content is nearly the same for all submissions, it highlights how getting to the front page can be a bit random. (Though this is the only one that capitalized 'Leveraged' so maybe that's the secret)
I've been doing the same experiment in tsz[1] for a while now (the same past five months in fact) and I have come to very similar conclusions. Lots of harness to enforce good architecture splits. Lots of tests and CI.<p>My point of working on tsz is to learn how to do very big projects with AI. Eventually the same workflows and attitude can be leveraged to build customer product apps with UI as well. I see that OpenAI is leveraging automated browser testing and even videos as part of their workflow. I think as models get better this direction for making software would eventually make sense. I don't think we're there yet though. But at least, unlike OpenAI vague claims I can share the output with you to see!<p>Most of the solutions that offer a very high level of automation like Lovable are a bit too optimistic and solutions are not tightly coupled with lots of automated testing.<p>[1] <a href="https://github.com/tsz-org/tsz" rel="nofollow">https://github.com/tsz-org/tsz</a>
The other day I came across to a video showing workers in a e-vape factory. They pick up a bunch of e-vapes from the conveyor belt (each has 6 e-vape think), stick in their mouth and vigorously vape all of them for about 5 seconds, then test the next bunch. Humans reviewing hundreds of lines of change in a PR written by AI is not very different.
This might work only if you have “infinite” compute and infinite tokens.<p>As someone that used the $20 plan, this pure agentic approach is impossible to do because I’d hit the limit fast and I would end up with less outcome.<p>What I found that work incredibly well was to provide a human written code as reference, and ask it to extend it. So I scaffold the entire thing, architect it, write few samples code (controllers, services, models, components, database schema, how auth works, etc) so the LLM can have a headstart on their attention (pun intended)<p>I usually wrote a stub with a lot of details on how to implement it. Something like a higher abstraction pseudo code. Then ask the LLM to implement it.<p>When it fails, it is often better to undo the whole changes, adjust the stub so it catches what fails before, and try again.<p>Or, commit the changes, and use a new fresh context and only address what went wrong.<p>-<p>Whenever I tried this agentic from scratch approach, I always end up disappointed; both on the outcome and on the limit that I hit before an hour even passed.
This mirrors exactly what I have been doing.<p>- Give Claude/Codex a way to verify its own work (browser, smoke tests, e2e tests, high-fidelity local environment)<p>- Keep all context (issue tracking, docs, ideas, plans, worklogs) in-repo (<a href="https://github.com/shepherdjerred/monorepo/tree/main/packages/docs" rel="nofollow">https://github.com/shepherdjerred/monorepo/tree/main/package...</a>)<p>- Give Claude/Codex access to observability (Grafana, Prometheus, Tempo, PagerDuty)<p>- Have Claude/Codex follow good engineering guidelines like fail-fast, type safety, parse at boundaries<p>I haven't yet been able to achieve full autonomy due to cost and CI load on my homelab.
I'm not an AI skeptic but I'm skeptical of the intent of this article. It makes great claims about agent-first engineering and tries to make a real case based on a real product, with real users, and a real team that's been growing — all without even saying what was built or showing it, just like every other AI hype article.
we interviewed Ryan here: <a href="https://www.latent.space/p/harness-eng" rel="nofollow">https://www.latent.space/p/harness-eng</a><p>and he gave a talk version of it in london: <a href="https://www.youtube.com/watch?v=am_oeAoUhew" rel="nofollow">https://www.youtube.com/watch?v=am_oeAoUhew</a>
I worry most about blindspots with this kind of approach. Let's say that this repository goes on for years, at which point the docs folder is several MB in size. Would Codex be able to think outside of the box? Or would the aggregate of the Markdown content fundamentally cover enough ground to prevent it from thinking of novel new approaches to existing problems?
You tell it to update the docs: not append. I've done the same thing with a readme in the root with links to the docs. After every commit, before the push, I have my agent "update all relevant and related docs, add or remove what's needed" or something to that extent. And it works remarkably well. I also have an append only change log it's supposed to add to. Between that, good commit messages, and comprehensive testing, I've built a homebrew OS and updating it is remarkably smooth. Runs a homebrew FTP and HTTP server and can run Wolfenstein. Working on DOOM right now. Close, but sound has been difficult.<p><a href="https://github.com/ESikich/smallos" rel="nofollow">https://github.com/ESikich/smallos</a>
Someone else in the comments said to have it make a static website with the info instead with clickable pages and sections so it reads only the content it needs to rather than dumping a long file into context windows. Although I suppose you can have a ToC in the readme too with multiple smaller markdown files as references.
Yep. You’ve got to have it update the docs. After a few sessions, if I forget to request this, opus starts rehashing the same tasks and finds that they are complete - and sometimes still won’t update those docs unless I ask.<p>Another tip is to condense the doc files into the minimal required. Sometimes I’ll end up with 5 to 6 floating around in various states of staleness. Condensing to 2-3 and removing completed tasks seems to help a lot
It’s not a self coding machine. There is human in the loop, they even added MORE engineers to the team of this project! 7 engineers should be able to collaborate with the AI to find good solutions to problems.
I think a lot of people are sleeping on the contents of this article. There is still valuable tidbits I'm going to be applying.
I’d be interested to know two things:<p>1. What’s the job satisfaction like day to day being an engineer on this project? How have they adapted to this way of working?<p>2. How much did it cost? Work is being done whilst the engineers sleep but if that 6 hours overnight task cost $300 and could have been done by a person in 2 hours is it a real saving?
This matches quite verbatim for my cursor based agentic repo.<p>There isn't anything that were not already experienced and factored into constructs in the repo.<p>And I also find all of the bits created for an effective agentic engineering project, matches perfectly with the main stream engineering best practices. That has been one of my primary reason to all in on agentic engineering, prior to this, applying best practices is always too costly and conflict with teams daily priority.
Codex updates usually appear every few hours (i am not saying this how often it's published) but that's my perception as a user. Often i update codex just to see new update within an hour so.<p>Many times those updates are not properly tested, for example in one update the model selector got completely changed.<p>then next hotfix was pushed which restored original.
> The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.<p>Can anyone give me a simplified explanation of what they’re saying here? Having some trouble understanding.
As other commenters have clarified, it's about layering, separation of concerns etc. Goes by many names. One such terminology here: <a href="https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)" rel="nofollow">https://en.wikipedia.org/wiki/Hexagonal_architecture_(softwa...</a>. DI frameworks use terminology like "Provider": <a href="https://en.wikipedia.org/wiki/Dependency_injection#Injectors" rel="nofollow">https://en.wikipedia.org/wiki/Dependency_injection#Injectors</a>
They're describing a layered architecture enforced by some script in CI.<p>For example, if you had a `backend`, `common`, and `frontend` package, you would be OK having backend/frontend depending on common, but you wouldn't want common depending on backend/frontend or backend/frontend depending on each other.<p>If you think about JavaScript, there is nothing stopping your dependency graph from becoming spaghetti. It sounds like they built static analysis to enforce rules.<p>Some languages have this built in like Java (Project Jigsaw), Go, and Rust. JavaScript, Python, etc. have no such feature.<p>It's really nothing special -- it has existed before. It just becomes a _lot_ more important with agents since they produce a lot of code, and it is good to have lots of static analysis when heavily utilizing agents.<p>They mention this in the article:<p>> This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.
IIUC its just strict separation of concerns<p>Eg UI cannot reach down and directly read config files<p>Configs must be only read by (im assuming) a storage interface layer called repo<p>There’s a strict directionality of dependency<p>Somewhat similar to ports and adaptors but presumably more strictly enforced by deterministic linters
I am at a major company that is essentially vibe coding. I’ve shipped about 100k LoC this entire half and am toward top 10% of my team. I find it likely that either<p>A. The code is absolute garbage and is speed for speed sake
B. They’re using an internal model that is a generation beyond GPT 5.5<p>I say this because we’ve attempted to do something similar using the latest gen Claude models and a significantly larger team. The code is probably along the lines of millions LoC but is an absolute mess because of vibing. There’s a price you pay for speed
I think vibe coding is okay for a small internal tool, dashboard, etc. but it’s definitely a no-go for a production running service.
Q1 - How much effort did you put into deterministic guardrails like AST linters, etc?<p>I find there’s a ton of slop unless hard guardrails are added, eg step 1 is just around syntax, step 2 is to enforce mental models<p>You still need someone steering direction and have a logically consistent idea of what you actually want to build<p>Q2 - I find that vibe coding really accelerates FE projects because it’s possible to run everything locally and check results<p>For pure distributed infra backend more investments have to be made into the devloop to be able to shift left the feedback loop and decouple it from humans or real deploys
I like how they said they were spending 20% of their time addressing slop. Sounds like they’ve tried to automate the slop correction but it’s a good honest reminder.<p>Additionally it’s an internal tool, which is likely much more amenable to slop.
Leveraging a better way. No last mile.<p><a href="https://github.com/space-bacon/SRT" rel="nofollow">https://github.com/space-bacon/SRT</a>
This would be much more convincing if the repos, issue trackers, etc. were accessible.
Published Feb 11, 2026
I started using chatgpt for functions and checking, then for single file changes and checking, now for multiple changes and checking. I am at a point where the only changes I correct are architectural. So it may start to become smarter to learn how to see only the architectural directions while multiple agents work, test, and commit both on unit and against live deployment.
> To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop ).<p><a href="https://ghuntley.com/loop/" rel="nofollow">https://ghuntley.com/loop/</a>
ok, but you had 1x token to generate , then more 1x to review locally, 1x for the agent local, 1 x for the cloud. then ???x until all bots are satisfied.<p>You end-up spending at least 5x the amount of tokens for maybe prediction machine to find a discontinuity?<p>I would say a way better approach is 1.123x to generate code + tests + passing analysis tools + human review + 1x "simplify as much as possible", than letting the snake its own tail without boundaries.
1 million lines of code aside, I feel like anyone who seriously thought about this would eventually run their own harness.<p>Just like .vimrc and .zshrc, the harness "code" itself can be easy and personal. Provided that it's built on working and existing construct such as tmux.
Codex pushed an update that made my old threads inaccessible. This takes a million of lines to put out a half baked crud app?
Isn't this essentially normal AI usage and what everyone has been doing for 6 months?
"Engineering"<p>These people are so delusional it feels like a mental desease by now.<p>I really hope no one gets hurt by all this slop code in the future by these wanna be engineers.
I understand that the’ve written zero lines of code for this application, but would it kill them to write a few lines of the blog post by hand?<p>Forcing readers to wade through an unceasing string of LLM clichés demonstrates the opposite of the point you’re trying to make—that the consumers of your work are worse off because you exercised no human judgment in creating it.
Dear OpenAI, the target audience of your blog or at least of this blog post understands English pretty well. Why won't you give them a simple way to disable the shitty ai translation and read the original content? Why translate it at all in the first place?<p>EDIT: found the button, all the way down in the bottom of the page... I hate this so much, give me the original content, I will decide if and when I need translation
But this is almost what we have been doing for the last 3/5 months, isn’t?
Step 1: Be rich.
why do you have “weeks” to ship what would take “months”?
Title should probably be marked with (February 2026).
The world is now agent-first already?
I would never dare put that in production
I wonder why we as engineers aren't protesting AI in the same way that artists and people in film and television are. This post should instill the same terror that visual artists feel.<p>If you're a more senior person in tech, this post is effectively saying that a large portion of your skillset is about to become completely worthless. This goes beyond the skills involved in writing the code. Everything that you've learned over years about how to determine whether code is good or bad, and what practices make an engineering team effective is not just obsolete, it's fundamentally counter-productive because it assumes a slow, human-centric process that requires you to actually review and understand the code. Even your ability to mentor junior engineers is now obsolete, because all that experience you've built up is now worthless to them.<p>If this is the approach the industry takes, particularly when combined with a lack of interest in quality from the business (and let's face it, consumers have shown us that they're happy to pay for cheap crap), it's hard to see much of a future for software engineers. You don't need thousands of people with deep technical expertise, you need a handful of manager-types, who will focus on defining product and business requirements and configuring how the AI gets enough context to implement the requirements.<p>Maybe, if we're extremely lucky, there's so much demand for software that total employment doesn't fall off a cliff, but the nature of the work will change so much that many older, more expensive engineers will become unemployable. Those who remain will have to accept that the skills they spend decades developing are now worthless, that younger engineers no longer respect or listen to them, that the business no longer sees them as experts worthy of respect, but old fogies who grew up in a different world.<p>Joe Biden liked to say that a job is more than just a paycheck, it's part of your identity and your sense of self-worth. We're all very used to a certain level of respect (and commensurate remuneration). If you don't think that's true, compare how a software engineer is treated to how a warehouse worker is treated. What happens when we lose that?
>a large portion of your skillset is about to become completely worthless<p>I'm not convinced of that.<p>I watched a video of an architect using AI to create architectural drawings. It became very clear to me that he has a lot of skills and terminology that helped him produce something very specific, in a few minutes. I've been working on some home improvement stuff including a studio/shed and I've struggled to produce even something simple (currently trying to get a conversation packet on the roof trusses to take the the permit department to get started). Even with my high school architecture class.<p>After watching that I wonder how much of what I'm doing with AI that looks easy is because I hae a deep technical knowledge, plus 3 years of heavy work with AI.
This is the case now - I can explain to the AI that I want to re-factor a component to support different implementations using a strategy pattern, and I can get a similar outcome to what I would have written, just implemented a bit faster. My expertise brings value.<p>But that's not what this specific article is describing. The world this article is describing is one where you describe the business requirements, and you don't think about how it's implemented. You don't write the code, you don't review the code, you don't test the code. You give the AI business requirements and you give it access to sources of context (slack, meeting notes, etc). Every place where the human would act as a gate reduces throughput, so it should be eliminated through building harnesses and providing context.<p>What they're doing here is the equivalent of taking a factory where you have 2 process engineers and 100 operators, and replacing all the operators with robots. They want to automate the whole process of making the software and just leave the part that figures out how to make the automation work effectively.<p>In this world, the average software company doesn't need people who know how to write good software, because writing, reviewing, maintaining, and testing the software will be entirely automated. There will be a small number of people at companies like OpenAI that need to know how to write good software in order to supervise training the models, and there will be a small number of people at the software companies who have expertise in setting up the automation.
How do you keep your skills if you no longer engage in the activity that keeps them sharp?
What is that protest going to get us? We'll convince or force business leaders to not use a cheaper/better tool and protect our jobs? And nobody else in the world is going to pivot either? And our companies will remain competitive?<p>Software engineers have always adapted to new technologies. New languages, frameworks, native apps, browser apps etc. So far this doesn't seem to be close to completely removing us from the loop.<p>If you are smart, educated, and can adapt, you'll figure it out. The economy has to find some stable equilibrium and it's not a zero sum game. Everyone in the economy getting a paycheck is also a consumer. With no consumers there is no business. The companies who are using AI and become more productive can do more things that before were not profitable but now are. Some of the people who are getting laid off are going to start new businesses and hire people. These things always cycle, and they basically have to.<p>I don't have a crystal ball though.
It's the other way around, unfortunately. The senior engineers will still be useful for architecture and infrastructure considerations, as well as guiding the agents. It's the junior engineers that get nailed, because there's little incentive to hire one when a LLM does a better job immediately and costs less.
That's true now. But in the world of this article, it's also the senior engineers that get nailed. In the world of this article, all code is like what machine code or bytecode is now - it's designed to be used by the machine, not the human, because the expectation is that humans will rarely, if ever, touch it.
Individual voices aren't strong enough to drown the marketing machine.<p>Artists and writers are unionized, why they have a more powerful collective voice.<p>Second, there are enough peole for which their jobs are very well paid and too cozy to dare to rock the boat.<p>The economy and job market isn't so hot either at the moment for people to quickly be able to jump ship.<p>Can you even be sure that you find a tech company that isn't jumping head first onto the AI hype train? Even politicians can't have enough of AI in their mouth.
engineers undervalue their own process<p>artists overvalue their own outputs
I for one am not protesting because I know that this is bullshit marketing nonsense. Look at reliability metrics of OpenAI, they’re terrible. Everyone knew a long way ahead that it’s a scam, now they’re cranking up pricing and trying to rug pull. There will be a lot of developers who will come out very well once the stock tanks. That’s my two cents
> in an agent-first world<p>casual gaslighting
[flagged]
[flagged]
[flagged]
[dead]
> Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.<p>This is such a common thing among software engineers nowadays that I was very surprised that OpenAI would open with that line as if it were mind blowing.<p>But then I saw it was published in February and OP is just reposting it to farm karma.
Everyone is criticizing the number of lines of code and the lack of attention that must certainly have been applied to generate that code and push it into production. What is being ignored is this awesome prompt that is almost certainly better than having no agents.md or plans.md or whatever you've come up with, to add validation steps for committed changes. You're still free to look at your code, the changes, and ask the agent to clean up. Try it. It's really nice.