Harness engineering: Leveraging Codex in an agent-first world

(openai.com)

204 points by pramodbiligiri1 day ago

42 comments

yurimo8 hours ago
What I still can't understand is why is massive amount of code generated is a flex? I don't feel that software has gotten a lot better in past 3 years, only sloppier. It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality. I'd argue you have to optimize for less lines generated as possible while secondary optimization should be readability for humans. I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.And if I am working on an existing codebase then isn't a good commit often a negative sum between added and removed lines? I don't want to bloat my codebase but make it more polished and elegant. After reading that I wonder if what they have done could have been accomplished for a far fewer LoC budget.
- bjertoref9 minutes ago
 > I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.To generate elegant code with more restrictions, it means more thinking tokens and more stronger adherence to instructions. So tha naive view that they are doing it for billing is wrong.
- CTDOCodebases58 minutes ago
 I think it's a rebuttal against claims that LLMs are incompatible with large code bases. It's not so much a flex about the quality of the code, it's more a flex about the complexity of the code and the LLMs ability to deal with such complexity.Whether or not that complexity is warranted is a different story.The codebase may be bloated by a factor of 10 but if the costs associated with that are less than the costs of developing the software from a business standpoint the choice is clear.
- YZF6 hours ago
 Lines of code has always been a terrible metric. But all else being equal it is a measure. If all else is not equal, which is usually the case, then it's not.A lot of the focus has been on AI recently.Three years ago we didn't have software where a non-software engineer can describe what they want in English and get working (-ish) software generated by other software? Is that not "software has gotten a lot better"?Other than that I'm not sure how we measure "software has gotten better". New applications? More features? How do we measure sloppier? Is Google Maps suddenly taking you the wrong way more often? I'm not really doubting your subjective experience but seriously how do tell? I mean a doc is a doc and a spreadsheet is a spreadsheet.We're also only about 10 months into models that are powerful enough to potentially make a bigger difference and we are still figuring out how to use them best.
 - yurimo3 hours ago
 I personally don't view coding agents making software as "software gotten better" you are comparing a tool and the end result, these are two different things. Agent you use going down and your product going down mean two different things to you customers. I will not deny that we made incredible progress in coding and hell, even design over the past 3.5 years, this technology is here to stay.That being said while I agree that measuring better quality of software is vague (part of the reason it is hard for models as well), there are universal things I believe every engineer will agree on. Reliability, uptime, customer feedback, legibility of your engineering, performance, these are things we often optimized for. Google Maps is a bit of a strawman because neither of us (unless you work on it), knows how much agent code there is, I think it is likely that it's little since it was working fine prior to 2023. I could bring up github reliability as an example, given how much copilot usage they promote at MS, but once again only folks there know for certain. I do, however, see scores of various AI powered SAAS that looks like it is in a perpetual MVP state. I think you are right in that even if agents give us "good enough" results and we can swallow failure rates and our increasingly lesser understanding of what we, or more so model, created, then it is still progress overall, but this is progress not to human-AI collaboration but to AI-only engineering IMO, this is good or bad depending on how you view the future.I'm a scientist and most of code I currently write is somewhere on the intersection of critical software and machine learning, squaring these two is not easy and I guess the way I was taught to reason about engineering informs my opinions on this. Maybe it's just a matter of time before codex can help here in an unconstrained manner as well, but I am skeptical at the moment.
 - arcticbull5 hours ago
 I also can't help but notice they didn't mention how many tokens were burned, or how much that translates to in terms of cost over the 5 months at enterprise AI prices. I'm going to guess this wasn't a cheap demo.
 - csomar1 hour ago
 > Is Google Maps suddenly taking you the wrong way more often?Funny you mention that because I had that issue in a cab just yesterday. Google decided to drive us of the main road to a series of small roads which happened to be a dead end. My guess is that the AI decided that this is a shorter road? less busier road?That being said, Google maps have been gradually degrading. Most notably, its search function is quasi-broken now.
- trollbridge7 hours ago
 The "lines of code" at this point are basically the same thing as binary code that comes out of a compiler - something you almost never look at and certainly won't try to touch by hand.The actual "code" is everything driving the harness.The current problem for this is that the harness is not (yet) deterministic, so it's sort of like having a compiler where your output program works slightly differently every build, and then the compiler tries to just patch the binary programs when you recompile to minimise this problem, or even worse, disassembles the whole thing to figure out what it does, makes the chance, and then recompiles it.
 - yurimo5 hours ago
 I think the telling part is in this line:> Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibilityI asked a question from a perspective of a human engineer, as in, I will have to read the code and understand, fix it once it breaks. OpenAI approach is opposite, even if it is breaking it is the agent that will be doing the fixing, millions of lines and inelegant designs don't matter because human readability doesn't matter. In any case you use more tokens so you fork over more money.I will say, however, that IMHO there is objectively bad and good code in terms what it can do and performance, if I can do the same thing in 50 lines as opposed to 1000 lines, this difference still matters for the model. Smaller context usage, better approach that informs downstream generation.
 - ArtRichards4 hours ago
 This is the part I think we will see become more relevant.I created docs-cli (pypi) to manage the index of specs as source code: the framework that goes with it will first create tests for as much as it can, so reproducability becomes the goal, not readability.<a href="https://github.com/ArtRichards/docs-cli" rel="nofollow">https://github.com/ArtRichards/docs-cli</a><a href="https://artrichards.github.io/agent-playbook-suite/blog/" rel="nofollow">https://artrichards.github.io/agent-playbook-suite/blog/</a>
- jstummbillig50 minutes ago
 Well, to be fair, the amount of goalpost shifting that is going on is quite intense. AI not being able to work in a "serious" project, and being limited to "toy projects" has been a long standing critique.But also, bigger projects need some amount of loc written and it's a bit silly to pretend that this is not the case or a bad thing.So the answer to the question is roughly: Establishing that an agent can work in a large-ish code base is valuable, because 1) them not being able to do so has been a critique and 2) it's something that is required for a lot of software projects.
 - TeriyakiBomb34 minutes ago
 I don’t think it’s solvable. And I think Anthropic etc know it. LLMs can only reconstitute things in its training data and they are so hungry they can’t do a good job in long lived codebase full of complexity and novelty. There’s never going to be enough similar code on the open internet.
 - ElFitz10 minutes ago
 > LLMs can only reconstitute things in its training dataSuch as a 4D raytracing engine in Metal? Or integrating APIs for features first released months after their knowledge cut-off date?LLMs have shown an ability to transfer "knowledge" and capabilities across domains, languages, and use-cases outside their training data.Case in point: GPT-2 "learning" to translate English to French and vice versa despite non-English examples having been voluntarily (and almost entirely) removed from the dataset.
- jh00ker3 hours ago
 >I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.I have also grown skeptical of token usage in order to run up my bill! But since I feel like it takes me MORE effort to write LESS lines of code myself, I'd expect a quick and dirty AI-generated solution to be MORE lines of code and cost LESS to generate than a concise/elegant solution in LESS lines of code.
 - small_model1 hour ago
 The latest frontier models will write code better than you and more elegant, with less lines of code, in 100th of the time, with full test coverage. Hand coding is like writing out assembly/machine code rather than using a compiler.
- pyrale1 hour ago
 > It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality.The simple answer is that promoting locs as a relevant metric is also reward hacking. Is it easier to promote big loc counts as a key metric, or is it easier to prove agentic engineering against harder metrics?On a more general note, software practice marketers have been pushing in that direction for quite a while. "You need cloud", "Here's how to do agile at scale", "microservice everything", etc.
- small_model1 hour ago
 It's not massive amount of code for the sake of it, it's that it built a complex large piece of software itself. If you compared the linux kernel LOC to some guys toy kernel on GitHub the LOC difference should give you an idea on the complexity. "Yeah why is Tolstoy flexing about his number of pages in his book, I wrote Spot goes to school and its only 20 pages"
- crdrost7 hours ago
 Yeah so all of personal computing—text editing, SVG antialiasing, etc, fits in 20,000 LOC (VPRI's STEPS project) so a million lines of code is 50 reïnventions of personal computing. BUT: it is unlikely that humans would have solved this problem in 20 kLOC. Sussman said “we really don't know how to compute!” as his talk title and LLMs had to ossify some pre-existing voice as the forever programming habitus and it chose a persona that doesn't know how to program—because we don't —and now we are stuck with it. Claude is our tickets, our implementations, our documentation... And if you tell it “hey the node role should not have those permissions, that should be a service account” it will happily do the right thing, but it has no intrinsic sense of taste and the error message it's trying to clear just says “the node role doesn't have that permission and the system prompt says “keep it short, stupid ” and graybeards might be our last bulwark.
 - zahlman6 hours ago
 > VPRI's STEPS projectThe what now? Search engines failed me here.
 - linguae5 hours ago
 This is a description of Alan Kay’s STEPS project:<a href="https://worrydream.com/refs/Kay_2007_-_STEPS_2007_Progress_Report.pdf" rel="nofollow">https://worrydream.com/refs/Kay_2007_-_STEPS_2007_Progress_R...</a>This is the final report:<a href="https://tinlizzie.org/VPRIPapers/tr2012001_steps.pdf" rel="nofollow">https://tinlizzie.org/VPRIPapers/tr2012001_steps.pdf</a>
- dotancohen4 hours ago
 I've heard it said that measuring productivity of a software developer by lines of code added, is akin to measuring the productivity of an aerospace engineer by mass added.It is a metric. It is often not a good metric. But it is easy to measure.
 - Sparkyte4 hours ago
 Indeed. The more you do to add complexity or generate without ensuring it isn't pointless code adds bloat. It is still probably more MVP than junk developers will do, but not ever anywhere near as great as someone who studies a programming language.
- B-Con4 hours ago
 People want to do X, so the metric is how much X can be done.Everyone is over-complicating the explanation. The answer for "why are we fixating on this bad metric" is almost always the same pattern.Broad audiences need simple metrics to talk about. If the metric itself requires nuance, it's hard to communicate and hard to reason about. It's easier to push the need for nuance from understanding the metric itself down the road to where the metric is applied, which allows everyone to ignore it in immediate conversation.
- bonigv6 hours ago
 It is a very valid question. My intution (no grounding) is to the model training. Optimizations traditionally have worked well in human wrote software with either experience of the developer , usage of architectural patterns or a second ir third pass of fine tuning. In case of model written code - (e/p one token at a time), only possible orchitectural optimization is either with a strict guardrail on patterns to use for a specific implementation OR by giving a second or third optmization path. All of which burns more tokens, but can lead to better software.
- cpard7 hours ago
 I don’t think the flex here is the amount of code alone. Their goal is to show that AI can improve productivity, the number of lines is just the proxy to that. This article is a marketing piece after all.Now someone can argue that lines of code are not a good proxy of engineering productivity, but I wouldn’t be surprised if the audience they target with this content is not the HN commenters of this thread.
- Npovview5 hours ago
 Don't compare human loc with machine loc.Compare machine to machine (as these headlines come) and discount that by a factor.
 - yurimo5 hours ago
 You can't really do that here because one of the key arguments for this, as people in the thread focus on is "1/10th of time" estimate, the comparison with humans is here already, albeit it is just an estimate and no actual comparison has been done.This is a problem of conflicting incentives that exists today in my opinion. Companies will market greater human-AI collaboration in science and engineering but focus on releasing things like this where it is clear that downstream goal is complete agent ownership over the product, from inception to testing to monitoring. Maybe the speculative future agents will use their own very efficient language to code that won't be readable for people at all. They focus on agent code being readable by agent in the article, as you've said. But in my mind in at least near future, there is a case where your prod will break, you won't be able to understand it or the attempted fixes. Maybe agent will fail to fix it at all and start a massive rewrite. In any case is this different from kicking technical debt down the road along with worse interpretability of what you have built?I do think there is a way where agent can write great solid code that we can read, but with the way LLMs are built this requires something new in terms of reward that accounts for "taste" and constant refinement so it might take more than 1/10th of a time to produce something good.
- threethirtytwo6 hours ago
 Because sloppy code and doesn't matter. He wrote he completed it in 1/10th time. That's equivalent to 1/10th the engineers.That is a business win. That is really all that matters in capitalism.The flex is a direct insult to your face. He is shitting on the faces of all software engineers (me included). It is equivalent to saying we don't need you to code anymore. One man can produce 10x the code.So why am i voting him up even though he's shitting on my face? Because what he says is true. I value honesty and people who say things like it is. Yes my identity as a software engineer is getting dismantled before my very eyes. But the solution to this problem isn't some delusional statement about not understanding what he's flexing about. We're not stupid. Everyone on this thread understands his flex. The difference is some people like you don't want to understand it.Like seriously. He literally wrote it was completed in 1/10th of the time and you expect me to believe that YOU don't know what HE is flexing about? Be real. You're not stupid.
 - estetlinus5 hours ago
 The real flex is delivering it in one-tenth of the time. Mentioning lines of code is mostly noise.I’ve worked with 20-year-old codebases and products that grew organically over decades and still sit well below a million lines of code. Using LOC as some kind of health or success metric makes me more suspicious than impressed.
- dumbdumb1257 hours ago
 It's a huge flex if the alternative is no code at all. Reward hacking aside, LOC resonates with me in the sense that I've seen 10+ projects to fruition that wouldn't have even begun without an agentic harness and an LLM.It's like the difference between doing stock price predictions with binary "up" or "down" histories and trying to figure out how to normalize actual price histories (basically impossible). The binary work gives a well-defined signal.
drivebyhooting7 hours ago
I wish these breathless blog posts would actually try to be more didactic.For example, actually doing a walkthrough of how to set up these allegedly super powered workflows and concrete demonstrations.I’m not an AI skeptic. Rather I’d don’t want to miss out on any actual super powers.
- nimonian7 hours ago
 I do quite a lot of what this post describes in a reasonably large project. Here's what works for me:- write gherkin features for new features; update them for enhancements; don't touch them for refactors. Label your PRs with these nouns.- use pre-push hooks for type checks, linting, unit tests, and other quick, scriptable validations.- make a viteperess subsite in your repo, have the agents maintain it - document important principles, architecture, etc.- make a cli command which lists all pages along with the yaml frontmatter description so agents can choose what to read without blowing up the context window.- use ddd and monorepo - write your logic in headless layers, and compose layers into apps. agents navigate layers very successfully.- use zod (or your language equivalent) and contract-first API development; this is my favourite bit tbh, I use orpc- make a single skill called "code" which describes the lifecycle: open a worktree, setup .env to guarantee no conflict with other agents (choose unused ports etc - docker is good here), write or update feature file (this is where you negotiate the spec), implement, validate (e.g. using playwright mcp), pre-push checks, push and wait for review, tear down and fast forward main- testcontainers is great for ensuring multiple agents can run tests that don't conflictSeriously I only have one skill that's it. Everything else is in the docs. I'm feeling very productive like this, in a "making good software" sense not a LoC sense.
 - nullbio6 hours ago
 Can you share your skill please?
 - pramodbiligiri3 hours ago
 I agree with many of the points made by nimonian above (esp the one starting with 'make a single skill called "code" which describes the lifecycle'), based on my limited experience with these things.I'm building a skill + CLI tool along those lines (for solo devs not corporates). Here is what my "lifecycle" type skill looks like right now: <a href="https://github.com/bitkentech/shipsmooth/blob/releases/dist/skills/start/SKILL.md" rel="nofollow">https://github.com/bitkentech/shipsmooth/blob/releases/dist/...</a> (warning, heavily work in progress). You can see a demo here: <a href="https://shipsmooth.net/" rel="nofollow">https://shipsmooth.net/</a>I was not happy with the default code quality generated by Claude Code. So I've been adding some skill-file rules to address that, and so far happy with the results: <a href="https://github.com/bitkentech/shipsmooth/tree/main/skills/experimental/refine/rules" rel="nofollow">https://github.com/bitkentech/shipsmooth/tree/main/skills/ex...</a>. There was a similar one on HN yesterday called opencodereview: <a href="https://news.ycombinator.com/item?id=48406358">https://news.ycombinator.com/item?id=48406358</a>There are many such workflows out there! Matt Pocock gave a good talk about how he approaches it: <a href="https://www.youtube.com/watch?v=-QFHIoCo-Ko" rel="nofollow">https://www.youtube.com/watch?v=-QFHIoCo-Ko</a>
 - rednb4 hours ago
 That's a big ask. This kind of harness usually contains plenty of proprietary insights about their business. And also, nowadays, a good harness is a major competitive advantage.
 - nullbio3 hours ago
 Good thing I wasn't asking you.Also, a skill is not a harness.
- bze123 hours ago
 I agree. I followed this article for a repo I'm working on, and I had a very hard time inferring how, specifically, they implemented "providers" and enforced import layers. A sample repo would've been nice.
bko10 hours ago
> We had weeks to ship what ended up being a million lines of code... Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.That's an insane level of throughput. What's a good baseline? Prior to agentic coding, whats the typical number of PRs engineers were expected to push? Maybe a 2-10?Do people feel the software has gotten better in the last 6 months? The number of engs is prob the same so we should expect maybe 5x faster cycle in major software apps, but I don't see it. The AI apps do change very fast but given its a very new field, I'd expect as much. But outside of that, I don't see it.
- torben-friis10 hours ago
 Here's a fun one: firefox lists its current count at about 2.5M LOC, from roughly 1M commits during the years.You end up with about 3 lines added per commit, which is not ridiculous when you consider that most would be editions rather than full additions.Here, we have 1500 PRs and 1M LOC, which is about 650 added LOC per PR. Remember, not 650 lines total in the PR, but +650 balance after additions-removals.Fun questions for attentive readers:- What does a project growing at a rate of one full firefox-codebase worth of LOC per year look like, a decade down the line?- What does the line count say about the verbosity of the tool, and what does it say about outcomes that the purpose of the project isn't clearly disclosed?- Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?- If it was confirmed that LLM usage blows up your line count, what's the implication for codebases that want to return to manual coding after months of usage? (Say, because the tool gets expensive).
 - stult5 hours ago
 > - Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?Yes, at least to the extent that we care about context windows and tokens consumed by coding agents processing code that is ultimately irrelevant to their assigned task.Anecdotally, I've found keeping file sizes small has been important for agentic coding not just to maintain human readability, but also for optimizing agent performance, precisely because it limits the amount of incidental context they load while working a problem, because they generally load entire files rather than just parsing the part relevant to their current assignment as a human might. That smaller file size thus reduces input noise and the LLM generates a tighter solution, which in turn reduces input noise for future solutions. Or at least this strategy avoids a death spiral into exploding context length.I expect (but cannot currently prove) that keeping overall LOC down yields similar benefits even when file sizes are kept small because it spares the LLM from parsing potentially relevant files that prove irrelevant to its current task.
 - therealdrag07 hours ago
 Does the Firefox LOC include ALL forms of text: infrastructure (Firefox doesn’t have), documentation, developer scripts,tests, etc? How is the test coverage of Firefox?
 - CleanCoder8 hours ago
 When I got to the 1M LOC I involuntarily paused feeling like this must be satire.
- krackers10 hours ago
 They never specified what exactly the product was, without which it's impossible to judge the post.For some reason most of the uses of "agents" are to build yet other AI products, it's turtles all the way down. Maybe that says more about the field of harnesses than it does about the power of "agents".
 - theptip8 hours ago
 There is a sense in which it doesn’t matter at all; many of the limitations of agents in large codebases are just the context management challenges. So proving that you can cohere and progress at O(1m) is a useful scale observation. “Can I use agents in my 1m line codebase?”There is of course another sense in which the output quality is the only thing that matters. “Can I use agents to build a 1m line codebase that I want to maintain going forward.”I take this as being exclusively a tech demo of the former. Quality (feature velocity, bugs, scalability) is not demonstrated.
 - becomevocal9 hours ago
 Feels like the active discovery going on is trying to understand what is computer vs what is AI, for every product.Agents help a ton with the discovery, but the act of building a product needs a deeper level of thought and validation to make it actually better than what came before. So IMO what you see is people still learning what needs to be understood and crafted first hand to make a product better (including economics)We’ll get there if more of us try
- techblueberry8 hours ago
 I’ve been vibe coding a lot over the past year or so, and I think I’m going to stop. In fact, I sort of want to challenge myself to see, can I go back to a sort of the fork in the road with the old copilot autocomplete workflow and really maximize that. Be in the drivers seat for most of the code being written, but find ways to use AI to really enhance the flow state / remove blockers. Tools only minimal actual code generation.
 - 59nadir3 hours ago
 I would be very impressed with someone who's been vibecoding "a lot" for about a year who could then go back to being fully in the loop for even 50%. I would even say I'd expect withdrawal symptoms at that point.The dopamine hits are core to why people even do vibecoding (or vibecoding-in-a-dress/spec-driven development) and why they tend to overestimate its output so much. Hell, it's core to all forms of LLM-assisted development (because it feels like magic), but most of the other forms are more value, less delusion.
- Aperocky10 hours ago
 > ended up being a million lines of codeThis almost reeks of "I've never cleaned up our code base because there is too much code, and didn't even bother having agents/LLM cleaning them up".You almost never need a million lines of code - this includes your software, infra, testing and operational tools. You didn't ship the linux kernel in 3 weeks and you know it. The code is already speghetti and it achieve the basic functions OK but it will harder and harder to simplify and untangle and maintain.
 - bombcar9 hours ago
 Even the linux kernel doesn't need millions of lines of code; most of the actual LOC is device drivers, and you don't need all of them, you just need the ones for the devices you have.
 - Chu4eeno9 hours ago
 And Linux maintainers are actively pushing to radically cut down on the LOC by eliminating drivers etc.
 - zahlman5 hours ago
 As a point of reference, 1MLOC is about the size of the entire Python standard library including tests, as well as stuff like IDLE. (Well, the Python part of the code. There's about half that much again of C in Modules/ .)
 - girvo10 hours ago
 Yeah I cannot see how "we shipped 1 million lines of code in three weeks" is... something to be proud of haha
 - faustin8 hours ago
 They directly address routine code cleanup and regularly paying down technical debt near the end of the article.
 - Aperocky5 hours ago
 I stand corrected, but the LOC being advertised still make me doubt the efficacy of their process.
- aleqs9 hours ago
 > should expect maybe 5x faster cycle in major software appsTo what end and what would that even look like though? Enshittifying everything at maximum speed? The apps/platforms I use regularly - GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.
 - linsomniac8 hours ago
 >GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.What if AI lets you create new versions of those tools, but without the enshitification?I say that being in the "soaking" stage of using AI to rebuild a shitty software project in 70KLOC over about 2 weeks of spare time, so this may not be as theoretical as you might think.
 - aleqs8 hours ago
 Oh I definitely agree that AI can and will help create great software.It's just that creating great software isn't really the SV/VC/big tech business model or main goal.
 - NoraCodes8 hours ago
 > What if AI lets you create new versions of those tools, but without the enshitification?I'm not sure I fully understand what you're saying here. Isn't the value of these tools almost entirely independent of their actual software? That is, we have many good open source, self-hostable forges (Forgejo, sr.ht, etc.), lots of great music player software (Jellyfin, Symphonium, etc.), and decent maps software (OsmAnd and Organic Maps). People use GitHub, Spotify, and Google Maps -- perhaps even _put up_ with their often bad/glitchy software -- because of network effects (all three) and content/licensing partnerships (Spotify/GMaps). That proprietary data isn't something AI can help you with, right?
 - linsomniac7 hours ago
 It really depends on the use-case. For example, my most starred github repo is a tool to convert Spotify playlists to YouTube Music (that was done pre-AI). Github depends on what issues you have with it, what your use case is, and whether you can leverage some of the network effects via API from the github source. Maps, same story.
 - nostrademons8 hours ago
 AI coders are great for making scrapers, possibly because AI companies use their own tools to make an awful lot of scrapers.
- dchftcs8 hours ago
 This is a lot tamer than what Claude Code's team claims tbf.
- ai-roundup8 hours ago
 [dead]
- jakolaptu8 hours ago
 It is likely better because AI agents make access to domain knowledge easier. However, I would wager that the problem is people don’t remember the code well. The problems are going to be long-term as the pace of change increases.If you think about it, successful products rely on designing well-thought-out experiences, customer discovery (see all the Forward-Deployed Enginneer job listings at OpenAI) so the code velocity somewhat becomes irrelevant.If you’re solving the right problem and you’ve got a good team then competitive advantage comes from somewhere OUTSIDE of code velocity.The more important question I think is does faster code yield more value long-term? At the moment, it’s like yeah we do 3.5 pull requests per day.I’m thinking, great, good for you. You could also combine three pull requests into one and then you’re doing 1 per day. This is quantitative data that doesn’t really mean anything tangible.
varenc9 hours ago
digression:It's interesting this was submitted to HN over 15 times since it was published in February: <a href="https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=https%3A%2F%2Fopenai.com%2Findex%2Fharness-engineering&sort=byPopularity&type=story" rel="nofollow">https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...</a>But this is the only submission that's had any traction. Since the content is nearly the same for all submissions, it highlights how getting to the front page can be a bit random. (Though this is the only one that capitalized 'Leveraged' so maybe that's the secret)
- swyx4 hours ago
 time of day also matters
mohsen12 hours ago
I've been doing the same experiment in tsz[1] for a while now (the same past five months in fact) and I have come to very similar conclusions. Lots of harness to enforce good architecture splits. Lots of tests and CI.My point of working on tsz is to learn how to do very big projects with AI. Eventually the same workflows and attitude can be leveraged to build customer product apps with UI as well. I see that OpenAI is leveraging automated browser testing and even videos as part of their workflow. I think as models get better this direction for making software would eventually make sense. I don't think we're there yet though. But at least, unlike OpenAI vague claims I can share the output with you to see!Most of the solutions that offer a very high level of automation like Lovable are a bit too optimistic and solutions are not tightly coupled with lots of automated testing.[1] <a href="https://github.com/tsz-org/tsz" rel="nofollow">https://github.com/tsz-org/tsz</a>
murat1249 hours ago
The other day I came across to a video showing workers in a e-vape factory. They pick up a bunch of e-vapes from the conveyor belt (each has 6 e-vape think), stick in their mouth and vigorously vape all of them for about 5 seconds, then test the next bunch. Humans reviewing hundreds of lines of change in a PR written by AI is not very different.
- prakashn277 hours ago
  Very true. If a PR has 1000 lines I would check only a handful full of them and leave the rest for test suit .
thelucent7 hours ago
This might work only if you have “infinite” compute and infinite tokens.As someone that used the $20 plan, this pure agentic approach is impossible to do because I’d hit the limit fast and I would end up with less outcome.What I found that work incredibly well was to provide a human written code as reference, and ask it to extend it. So I scaffold the entire thing, architect it, write few samples code (controllers, services, models, components, database schema, how auth works, etc) so the LLM can have a headstart on their attention (pun intended)I usually wrote a stub with a lot of details on how to implement it. Something like a higher abstraction pseudo code. Then ask the LLM to implement it.When it fails, it is often better to undo the whole changes, adjust the stub so it catches what fails before, and try again.Or, commit the changes, and use a new fresh context and only address what went wrong.-Whenever I tried this agentic from scratch approach, I always end up disappointed; both on the outcome and on the limit that I hit before an hour even passed.
- zuzululu5 hours ago
 You are not going anywhere with $20 planUpgrade to $200/month and you should see more usage but even for a hardcore user for me, one can never have enough.I'm still very jealous of those guys that got 200x usage simply by RSVP'ing to openai party
shepherdjerred8 hours ago
This mirrors exactly what I have been doing.- Give Claude/Codex a way to verify its own work (browser, smoke tests, e2e tests, high-fidelity local environment)- Keep all context (issue tracking, docs, ideas, plans, worklogs) in-repo (<a href="https://github.com/shepherdjerred/monorepo/tree/main/packages/docs" rel="nofollow">https://github.com/shepherdjerred/monorepo/tree/main/package...</a>)- Give Claude/Codex access to observability (Grafana, Prometheus, Tempo, PagerDuty)- Have Claude/Codex follow good engineering guidelines like fail-fast, type safety, parse at boundariesI haven't yet been able to achieve full autonomy due to cost and CI load on my homelab.
- para_parolu8 hours ago
 Does it yield good results? I found that instead of docs it’s easier just to ask ai to read code. I feel like this is same as comments in code. Become outdated fast
 - shepherdjerred8 hours ago
 I don't really use "docs" for documentation. I've prompted Claude/Codex to always write a "log" and save it in-repo to track what it did and why.I've found this to be really helpful, e.g. "you did this last week, and now some other thing is happening" or "you tried this approach before to solve alert X but it didn't work" -- except it can discover this itself.<a href="https://github.com/shepherdjerred/monorepo/tree/main/packages/docs/logs" rel="nofollow">https://github.com/shepherdjerred/monorepo/tree/main/package...</a>I've also used it to store TODOs and plans. For example I might want to explore some idea and defer it for later, or some weekend have it execute on some tech debt I've put off. One last use case is asking "what did I work on in the last 2-3 weeks, is it healthy, and what additional quality checks can/should I do; is there any follow-up work?"
 - c0rruptbytes53 minutes ago
 it does not result in great results left unattended, it’ll start creating slop or hardcoding solutionsbut overtime if you adjust your verification rubric, it’s not too bad, gets pretty good, if you do make it do TDD, it gets kinda crazy and you’ll have 2000-3000 tests after awhile, or on my common case, 6000-7000 lines of code in single files (i usually have a cron to audit files for decomposition and create tickets)i wouldn’t use it at my job yet, but it’s been fun to use for personal projects - it’s like modded minecraft automation or factorio
h4ny5 hours ago
I'm not an AI skeptic but I'm skeptical of the intent of this article. It makes great claims about agent-first engineering and tries to make a real case based on a real product, with real users, and a real team that's been growing — all without even saying what was built or showing it, just like every other AI hype article.
swyx4 hours ago
we interviewed Ryan here: <a href="https://www.latent.space/p/harness-eng" rel="nofollow">https://www.latent.space/p/harness-eng</a>and he gave a talk version of it in london: <a href="https://www.youtube.com/watch?v=am_oeAoUhew" rel="nofollow">https://www.youtube.com/watch?v=am_oeAoUhew</a>
zatkin9 hours ago
I worry most about blindspots with this kind of approach. Let's say that this repository goes on for years, at which point the docs folder is several MB in size. Would Codex be able to think outside of the box? Or would the aggregate of the Markdown content fundamentally cover enough ground to prevent it from thinking of novel new approaches to existing problems?
- esikich8 hours ago
 You tell it to update the docs: not append. I've done the same thing with a readme in the root with links to the docs. After every commit, before the push, I have my agent "update all relevant and related docs, add or remove what's needed" or something to that extent. And it works remarkably well. I also have an append only change log it's supposed to add to. Between that, good commit messages, and comprehensive testing, I've built a homebrew OS and updating it is remarkably smooth. Runs a homebrew FTP and HTTP server and can run Wolfenstein. Working on DOOM right now. Close, but sound has been difficult.<a href="https://github.com/ESikich/smallos" rel="nofollow">https://github.com/ESikich/smallos</a>
 - satvikpendem6 hours ago
 Someone else in the comments said to have it make a static website with the info instead with clickable pages and sections so it reads only the content it needs to rather than dumping a long file into context windows. Although I suppose you can have a ToC in the readme too with multiple smaller markdown files as references.
 - vibcdingenjoyer8 hours ago
 Yep. You’ve got to have it update the docs. After a few sessions, if I forget to request this, opus starts rehashing the same tasks and finds that they are complete - and sometimes still won’t update those docs unless I ask.Another tip is to condense the doc files into the minimal required. Sometimes I’ll end up with 5 to 6 floating around in various states of staleness. Condensing to 2-3 and removing completed tasks seems to help a lot
- therealdrag07 hours ago
 It’s not a self coding machine. There is human in the loop, they even added MORE engineers to the team of this project! 7 engineers should be able to collaborate with the AI to find good solutions to problems.
zuzululu5 hours ago
I think a lot of people are sleeping on the contents of this article. There is still valuable tidbits I'm going to be applying.
simonbarker873 hours ago
I’d be interested to know two things:1. What’s the job satisfaction like day to day being an engineer on this project? How have they adapted to this way of working?2. How much did it cost? Work is being done whilst the engineers sleep but if that 6 hours overnight task cost $300 and could have been done by a person in 2 hours is it a real saving?
bigcat123456784 hours ago
This matches quite verbatim for my cursor based agentic repo.There isn't anything that were not already experienced and factored into constructs in the repo.And I also find all of the bits created for an effective agentic engineering project, matches perfectly with the main stream engineering best practices. That has been one of my primary reason to all in on agentic engineering, prior to this, applying best practices is always too costly and conflict with teams daily priority.
faangguyindia9 hours ago
Codex updates usually appear every few hours (i am not saying this how often it's published) but that's my perception as a user. Often i update codex just to see new update within an hour so.Many times those updates are not properly tested, for example in one update the model selector got completely changed.then next hotfix was pushed which restored original.
- dawnerd9 hours ago
 Who needs a QA team when you can just test on users and iterate instantly /s
 - frictasolver4 hours ago
 [flagged]
ajpaulson6 hours ago
> The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.Can anyone give me a simplified explanation of what they’re saying here? Having some trouble understanding.
- pramodbiligiri2 hours ago
 As other commenters have clarified, it's about layering, separation of concerns etc. Goes by many names. One such terminology here: <a href="https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)" rel="nofollow">https://en.wikipedia.org/wiki/Hexagonal_architecture_(softwa...</a>. DI frameworks use terminology like "Provider": <a href="https://en.wikipedia.org/wiki/Dependency_injection#Injectors" rel="nofollow">https://en.wikipedia.org/wiki/Dependency_injection#Injectors</a>
- shepherdjerred6 hours ago
 They're describing a layered architecture enforced by some script in CI.For example, if you had a `backend`, `common`, and `frontend` package, you would be OK having backend/frontend depending on common, but you wouldn't want common depending on backend/frontend or backend/frontend depending on each other.If you think about JavaScript, there is nothing stopping your dependency graph from becoming spaghetti. It sounds like they built static analysis to enforce rules.Some languages have this built in like Java (Project Jigsaw), Go, and Rust. JavaScript, Python, etc. have no such feature.It's really nothing special -- it has existed before. It just becomes a _lot_ more important with agents since they produce a lot of code, and it is good to have lots of static analysis when heavily utilizing agents.They mention this in the article:> This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.
- iso13376 hours ago
 IIUC its just strict separation of concernsEg UI cannot reach down and directly read config filesConfigs must be only read by (im assuming) a storage interface layer called repoThere’s a strict directionality of dependencySomewhat similar to ports and adaptors but presumably more strictly enforced by deterministic linters
charintstr8 hours ago
I am at a major company that is essentially vibe coding. I’ve shipped about 100k LoC this entire half and am toward top 10% of my team. I find it likely that eitherA. The code is absolute garbage and is speed for speed sake B. They’re using an internal model that is a generation beyond GPT 5.5I say this because we’ve attempted to do something similar using the latest gen Claude models and a significantly larger team. The code is probably along the lines of millions LoC but is an absolute mess because of vibing. There’s a price you pay for speed
- sajithdilshan2 hours ago
 I think vibe coding is okay for a small internal tool, dashboard, etc. but it’s definitely a no-go for a production running service.
- iso13376 hours ago
 Q1 - How much effort did you put into deterministic guardrails like AST linters, etc?I find there’s a ton of slop unless hard guardrails are added, eg step 1 is just around syntax, step 2 is to enforce mental modelsYou still need someone steering direction and have a logically consistent idea of what you actually want to buildQ2 - I find that vibe coding really accelerates FE projects because it’s possible to run everything locally and check resultsFor pure distributed infra backend more investments have to be made into the devloop to be able to shift left the feedback loop and decouple it from humans or real deploys
- therealdrag07 hours ago
 I like how they said they were spending 20% of their time addressing slop. Sounds like they’ve tried to automate the slop correction but it’s a good honest reminder.Additionally it’s an internal tool, which is likely much more amenable to slop.
spacebacon3 hours ago
Leveraging a better way. No last mile.<a href="https://github.com/space-bacon/SRT" rel="nofollow">https://github.com/space-bacon/SRT</a>
jonmoore8 hours ago
This would be much more convincing if the repos, issue trackers, etc. were accessible.
angrydev9 hours ago
Published Feb 11, 2026
- ukuina8 hours ago
  Might as well be 2025.
Frannky7 hours ago
I started using chatgpt for functions and checking, then for single file changes and checking, now for multiple changes and checking. I am at a point where the only changes I correct are architectural. So it may start to become smarter to learn how to see only the architectural directions while multiple agents work, test, and commit both on unit and against live deployment.
andai8 hours ago
> To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop ).<a href="https://ghuntley.com/loop/" rel="nofollow">https://ghuntley.com/loop/</a>
- aniceperson2 hours ago
 ok, but you had 1x token to generate , then more 1x to review locally, 1x for the agent local, 1 x for the cloud. then ???x until all bots are satisfied.You end-up spending at least 5x the amount of tokens for maybe prediction machine to find a discontinuity?I would say a way better approach is 1.123x to generate code + tests + passing analysis tools + human review + 1x "simplify as much as possible", than letting the snake its own tail without boundaries.
Aperocky8 hours ago
1 million lines of code aside, I feel like anyone who seriously thought about this would eventually run their own harness.Just like .vimrc and .zshrc, the harness "code" itself can be easy and personal. Provided that it's built on working and existing construct such as tmux.
darepublic10 hours ago
Codex pushed an update that made my old threads inaccessible. This takes a million of lines to put out a half baked crud app?
mgaunard5 hours ago
Isn't this essentially normal AI usage and what everyone has been doing for 6 months?
witx1 hour ago
"Engineering"These people are so delusional it feels like a mental desease by now.I really hope no one gets hurt by all this slop code in the future by these wanna be engineers.
rfw30010 hours ago
I understand that the’ve written zero lines of code for this application, but would it kill them to write a few lines of the blog post by hand?Forcing readers to wade through an unceasing string of LLM clichés demonstrates the opposite of the point you’re trying to make—that the consumers of your work are worse off because you exercised no human judgment in creating it.
aulin4 hours ago
Dear OpenAI, the target audience of your blog or at least of this blog post understands English pretty well. Why won't you give them a simple way to disable the shitty ai translation and read the original content? Why translate it at all in the first place?EDIT: found the button, all the way down in the bottom of the page... I hate this so much, give me the original content, I will decide if and when I need translation
drchaim10 hours ago
But this is almost what we have been doing for the last 3/5 months, isn’t?
- wilsonnb310 hours ago
  Article is from February so that tracks
- fbrncci9 hours ago
  Well to a lot of people this is still a foreign concept.
nullbio3 hours ago
Step 1: Be rich.
bronny19899 hours ago
why do you have “weeks” to ship what would take “months”?
- andai8 hours ago
  > We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude.
  - robotresearcher8 hours ago
    I guess orders of magnitude ain’t what they used to be.
IAmGraydon7 hours ago
Title should probably be marked with (February 2026).
shevy-java4 hours ago
The world is now agent-first already?
Sarkie10 hours ago
I would never dare put that in production
apical_dendrite8 hours ago
I wonder why we as engineers aren't protesting AI in the same way that artists and people in film and television are. This post should instill the same terror that visual artists feel.If you're a more senior person in tech, this post is effectively saying that a large portion of your skillset is about to become completely worthless. This goes beyond the skills involved in writing the code. Everything that you've learned over years about how to determine whether code is good or bad, and what practices make an engineering team effective is not just obsolete, it's fundamentally counter-productive because it assumes a slow, human-centric process that requires you to actually review and understand the code. Even your ability to mentor junior engineers is now obsolete, because all that experience you've built up is now worthless to them.If this is the approach the industry takes, particularly when combined with a lack of interest in quality from the business (and let's face it, consumers have shown us that they're happy to pay for cheap crap), it's hard to see much of a future for software engineers. You don't need thousands of people with deep technical expertise, you need a handful of manager-types, who will focus on defining product and business requirements and configuring how the AI gets enough context to implement the requirements.Maybe, if we're extremely lucky, there's so much demand for software that total employment doesn't fall off a cliff, but the nature of the work will change so much that many older, more expensive engineers will become unemployable. Those who remain will have to accept that the skills they spend decades developing are now worthless, that younger engineers no longer respect or listen to them, that the business no longer sees them as experts worthy of respect, but old fogies who grew up in a different world.Joe Biden liked to say that a job is more than just a paycheck, it's part of your identity and your sense of self-worth. We're all very used to a certain level of respect (and commensurate remuneration). If you don't think that's true, compare how a software engineer is treated to how a warehouse worker is treated. What happens when we lose that?
- linsomniac8 hours ago
 >a large portion of your skillset is about to become completely worthlessI'm not convinced of that.I watched a video of an architect using AI to create architectural drawings. It became very clear to me that he has a lot of skills and terminology that helped him produce something very specific, in a few minutes. I've been working on some home improvement stuff including a studio/shed and I've struggled to produce even something simple (currently trying to get a conversation packet on the roof trusses to take the the permit department to get started). Even with my high school architecture class.After watching that I wonder how much of what I'm doing with AI that looks easy is because I hae a deep technical knowledge, plus 3 years of heavy work with AI.
 - apical_dendrite5 hours ago
 This is the case now - I can explain to the AI that I want to re-factor a component to support different implementations using a strategy pattern, and I can get a similar outcome to what I would have written, just implemented a bit faster. My expertise brings value.But that's not what this specific article is describing. The world this article is describing is one where you describe the business requirements, and you don't think about how it's implemented. You don't write the code, you don't review the code, you don't test the code. You give the AI business requirements and you give it access to sources of context (slack, meeting notes, etc). Every place where the human would act as a gate reduces throughput, so it should be eliminated through building harnesses and providing context.What they're doing here is the equivalent of taking a factory where you have 2 process engineers and 100 operators, and replacing all the operators with robots. They want to automate the whole process of making the software and just leave the part that figures out how to make the automation work effectively.In this world, the average software company doesn't need people who know how to write good software, because writing, reviewing, maintaining, and testing the software will be entirely automated. There will be a small number of people at companies like OpenAI that need to know how to write good software in order to supervise training the models, and there will be a small number of people at the software companies who have expertise in setting up the automation.
 - jplusequalt6 hours ago
 How do you keep your skills if you no longer engage in the activity that keeps them sharp?
- YZF6 hours ago
 What is that protest going to get us? We'll convince or force business leaders to not use a cheaper/better tool and protect our jobs? And nobody else in the world is going to pivot either? And our companies will remain competitive?Software engineers have always adapted to new technologies. New languages, frameworks, native apps, browser apps etc. So far this doesn't seem to be close to completely removing us from the loop.If you are smart, educated, and can adapt, you'll figure it out. The economy has to find some stable equilibrium and it's not a zero sum game. Everyone in the economy getting a paycheck is also a consumer. With no consumers there is no business. The companies who are using AI and become more productive can do more things that before were not profitable but now are. Some of the people who are getting laid off are going to start new businesses and hire people. These things always cycle, and they basically have to.I don't have a crystal ball though.
- briHass8 hours ago
 It's the other way around, unfortunately. The senior engineers will still be useful for architecture and infrastructure considerations, as well as guiding the agents. It's the junior engineers that get nailed, because there's little incentive to hire one when a LLM does a better job immediately and costs less.
 - apical_dendrite5 hours ago
 That's true now. But in the world of this article, it's also the senior engineers that get nailed. In the world of this article, all code is like what machine code or bytecode is now - it's designed to be used by the machine, not the human, because the expectation is that humans will rarely, if ever, touch it.
- mhitza5 hours ago
 Individual voices aren't strong enough to drown the marketing machine.Artists and writers are unionized, why they have a more powerful collective voice.Second, there are enough peole for which their jobs are very well paid and too cozy to dare to rock the boat.The economy and job market isn't so hot either at the moment for people to quickly be able to jump ship.Can you even be sure that you find a tech company that isn't jumping head first onto the AI hype train? Even politicians can't have enough of AI in their mouth.
- zuzululu5 hours ago
 engineers undervalue their own processartists overvalue their own outputs
- usernametaken295 hours ago
 I for one am not protesting because I know that this is bullshit marketing nonsense. Look at reliability metrics of OpenAI, they’re terrible. Everyone knew a long way ahead that it’s a scam, now they’re cranking up pricing and trying to rug pull. There will be a lot of developers who will come out very well once the stock tanks. That’s my two cents
Yokohiii7 hours ago
> in an agent-first worldcasual gaslighting
eddysir2 hours ago
[flagged]
Waffle21807 hours ago
[flagged]
jlintc10 hours ago
[flagged]
trytodupe8 hours ago
[dead]
EnPissant8 hours ago
> Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.This is such a common thing among software engineers nowadays that I was very surprised that OpenAI would open with that line as if it were mind blowing.But then I saw it was published in February and OP is just reposting it to farm karma.
knicholes10 hours ago
Everyone is criticizing the number of lines of code and the lack of attention that must certainly have been applied to generate that code and push it into production. What is being ignored is this awesome prompt that is almost certainly better than having no agents.md or plans.md or whatever you've come up with, to add validation steps for committed changes. You're still free to look at your code, the changes, and ask the agent to clean up. Try it. It's really nice.