You didn't reduce e2e test time. You've reduced the e2e test coverage by only running those tests that the LLM tells you to run.
Reminds me that time when I was doing some POC on an existing code base and I reduced build time by deleting all unit tests
+1<p>At Meta we do similar heuristics on which tests to run per PR. When the system gets it wrong, which is often, its painful and leads to merged code that broke unrun tests
Fwiw i feel like you could still do both, no? Ie run all tests daily or whatever cadence is not painful but still protective, but _also_ use heuristics to improve PR responsiveness.<p>Like anything it's about tradeoffs. Though if it were me i'd simply write a mechanism which deterministically decides which areas of code pertain to which tests and use that simple program to determine which tests run. The algorithm could be loosely as simple as things like code owners and git-blame, but relative to some big set of Code->Test that you can have Claude/etc build ahead of time. The difference being it's deterministic between PRs and can be audited by humans for anything obviously poor or w/e.<p>As much as LLMs are interesting to me i hate using them in places where i want consistency. CI seems terrible for them.. to me at least.
> When the system gets it wrong, which is often<p>Which suggests there might be big room for improvement. If LLMs open up new ways to be more precise about which E2E to run, that is a significant benefit, because you’d manage risk but reduce the time and resources it takes.
Why do you respond to the title when that's addressed in the opening.
No, it's not. The key concept here is the reduction of <i>coverage</i>.<p>The opening part of the post makes certain claims -- "catch bugs before they ship" and "keep our master branch always clean" -- that can only be true if the coverage is not reduced. But, as other replies point out, if the heuristic for test selection has false negatives, the coverage is effectively reduced, which can result in bugs.<p>That problem can be bad enough when the heuristic is deterministic. AI throws determinism out of the window, which only exacerbates the problem.
Yeah I’m not seeing any evidence that this actually works. I would’ve like to see some testing where they intentionally introduce a bug (ideally a tricky bug in a part of code that isn’t directly changed by the diff) and see if Claude catches it.<p>A good middle ground could be to allow the diff to land once the “AI quick check” passed, then keep the full test suite running in the background. If they run them side by side for a while and see that the AI quick check caught the failing test every time, I’d be convinced.
This is a hacky joke. No sane engineer would ever sign off on this. Even for a 1-5 person team, why would I want a probabilistic selection of test execution?<p>The solution to running only e2e tests on affected files has been around long before LLM. This is a bandage on poor CI.
I have worked at large, competent companies, and the problem of "which e2e tests to execute" is significantly more complicated than you seem to suggest that it is. I've worked with smart engineers that put a lot of time into this problem to only get only middling results.
I think you might be confusing end-to-end (E2E) tests with other types of testing, such as unit and integration tests. No one is advocating this approach for unit tests, which should still run in their entirety on every pull request.<p>Running all E2E tests in a pipeline isn't feasible due to time constraints (takes hours). Most companies just run these tests nightly (and we still do). Which means we would still catch any issues that slip through the initial screening. But so far, nothing did.
> The solution to running only e2e tests on affected files has been around long before LLM.<p>This doesn't work in distributed systems, since changing the behavior of one file that's compiled in one binary can cause a downstream issue in a separate binary that sends a network call to the first. e.g. A programmer makes a behavioral change to binary #1 that falls within defined behavior, but encounters Hyrum's Law because of a valid behavior of binary #2.
I would sign off on it. The only evidence I would need to see is some analysis of whether the risks are worth the benefits.<p>Risks: missing e2e tests that should have run letting bugs into production, more time spent chasing down flakes due to non determinism.<p>Benefits: increased productivity, catch bugs sooner (since you can run e2e tests more often).
Historically, this kind of test optimization was done either with static analysis to understand dependency graphs and/or runtime data collected from executing the app.<p>However, those methods are tightly bound to programming languages, frameworks, and interpreters so they are difficult to support across technology stacks.<p>This approach substitutes the intelligence of the LLM to make educated guesses about what tests execute, to achieve the same goal of executing all of the tests that could fail and none of the rest (balancing a precision/recall tradeoff). What’s especially interesting about this to me is that the same technique could be applied to any language or stack with minimal modification.<p>Has anyone seen LLMs in other contexts being substituted for traditional analysis to achieve language agnostic results?
That’s a great point — the portability angle is definitely one of the most intriguing aspects here. Traditional static analysis tools are powerful but usually brittle outside of their home ecosystem, and the investment required to build/maintain them across multiple languages is huge. Using an LLM essentially offloads that specialization into the model’s “latent knowledge,” which could make it a lot more accessible for polyglot teams or systems with mixed stacks.<p>One interesting side effect I’ve noticed when experimenting with LLM-driven heuristics is that, while they lack the determinism of static/runtime analysis, they can sometimes capture semantic relationships that are invisible to traditional tooling. For example, a change in a configuration file or documentation might not show up in a dependency graph, but an LLM can still reason that it’s likely to impact certain classes of tests. That fuzziness can introduce false positives, but it also opens the door to catching categories of risk that would normally be missed.<p>I think the broader question is how comfortable teams are with probabilistic guarantees in their workflows. For some, the precision/recall tradeoff is acceptable if it means faster feedback and reduced CI bills. For others, the lack of hard guarantees makes it a non-starter. But I do see a pattern emerging where LLM-based analysis isn’t necessarily a replacement for traditional methods, but a complementary layer that can generalize across stacks and fill in gaps where traditional tools don’t reach.
I don’t think you truly get the “best” of both worlds, because the rate of accidentally omitting a broken test and letting something slip into master is now non-zero (flaky tests aside). This is still a tradeoff. But maybe it’s a good one!<p>I do wonder if this is as feasible at scale, where breaking master can be extremely costly (although at least it’s not running all tests for all commits, so a broken test won’t break all CI runs). Maybe it could be paired with, say, running all E2E tests post-merge and reporting breakages ASAP.
I think the point is, E2E takes too long, so it was only run nightly. Now it's also run, selectively, as well as nightly
Another idea would be to still run the rest of the E2E tests pre-merge, but as a separate job that only makes itself known if a failure occurs.
> The key phrase here is "think deep". This tells Claude Code not to be lazy with its analysis (while spending more thinking tokens). Without it, the output was very inconsistent. I used to joke that without it, Claude runs in “engineering manager mode” by delegating the work.<p>This section really stood out to me. I knew that asking GPT-5 to think gets better results, but I didn't know Claude had the same behavior. I'd love to see some sort of % success before and after "think deep" was added. Should I be adding "think deep" to all my non-trivial queries to Claude?
In Anthropic's Claude Code Best Practices [1] (section 3.a.2), they note prompting with certain phrases corresponds to increasing levels of "thinking" for Claude Code:<p>"think" < "think hard" < "think harder" < "ultrathink"<p>[1] <a href="https://www.anthropic.com/engineering/claude-code-best-practices" rel="nofollow">https://www.anthropic.com/engineering/claude-code-best-pract...</a>
Since this is just a keyword, I'd expect you to prepend the entire prompt with a single line that says "think deep" or whichever think-level you want, rather than put "think deep" somewhere that makes sense in english within the prompt, as they do in the article.
To be clear I think this is an inference-level feature that looks for the string and performs inference at a higher thinking level, rather than a feature the model is trained on. Although I could be wrong, or it could be both.
Wow! Thanks for sharing - I didn't know this!
Man this is useful, I wish I knew earlier. I should really read the documentation. Or have Claude read it for me.
There is "think" and "super think" (there may be something in the middle).<p>If you use the word "think" at any point, this mode will also trigger...which is sometimes inconvenient.<p>Plan mode is probably better for these situations though. Claude seems to have got much worse recently but even before, if it was stuck, then asking it so "super think" never dislodged it (presumably because AI agents have no capacity for creativity, telling them to think more is just looping harder).<p>The loop of plan -> do is significantly more effective as most AI agents that I have used extensively will get lost on trivial tasks if you just have the "do" phase without a clear plan (some, such as GPT, also appear unable to plan effectively...we have a contract with OpenAI at work, I have no idea how people use it).
> I'd love to see some sort of % success before and after "think deep" was added. Should I be adding "think deep" to all my non-trivial queries to Claude<p>Any such % would be meaningless because both are non-deterministic blackboxes with literally undefined behavior. Any % you'd be seeing could be just differences in available compute as the US is waking up
I am describing a simple eval. This is a strategy that determines how effective an AI system is, they have existed long before LLMs were ever a thing, and they are perfectly happy to deal with non-deterministic behavior. You are acting as if LLMs being non-deterministic means you can't say anything about them, but before LLMs were non-deterministic, traditional ML systems were non-deterministic, and we had no problem building systems to probabilistically evaluate them.
> Yes, and I'm not exaggerating. Claude never missed a relevant E2E test.<p>I was waiting to see this part demonstrated and validated - for a given pr, whether you created an expected set of tests that would run and then compare it to what actually ran. Without a baseline, as any tester would tell you, the LLM output has been trusted without checking.
I should have added that to the post, but we verify by running a full test suite in a cron job, so if a bug slips, we will know. The full suite sometimes breaks, but these are due to changes outside of PR (e.g., a vendor goes down).
That doesn't verify what you think it does, though. It doesn't verify that the LLM ran all the tests necessary to exercise the code paths that were changed. "Running all necessary tests" is indistinguishable from "we did a good job on these PRs and didn't write any new bugs that would have been caught by the tests". This is the classic situation where you can prove a negative (if all the LLM-selected tests pass, but there's a new failure in the full suite), but can't prove a positive (if all the LLM-selected tests pass, and the full suite also passes, it could just mean there were no new bugs, not that the LLM selected the right tests).<p>The only way to verify that is to do what GP suggested: for each PR, manually make a list of the tests that you believe should be run, and then compare that to the list of tests that the LLM actually runs. Obviously you aren't going to do this forever, but you should either do it for every single PR for some number of PRs, or pick a representative sample over time instead.
Don't do this. If you have test time issues that are clogging the build pipeline, you can run a merge queue that creates ghost branches to test patches in sets rather than one by one.
Agreed, this is just bad engineering. I hope we don’t see more ideas like this bleed into actual release processes.<p>Outside of a fun project to tinker on and see the results, I wouldn’t use this for anything that runs in production. It is better to use an LLM to assist you in building better static analysis tools, than it is to use the proposed technique in the article.
Few questions:<p><pre><code> 1. Are you seeing a lower number of production hot fixes?
2. Have you tried other models? Or thought about using Output of different models and using combining the output if they have a delta.
3. Other than time & cost, what benchmarks in terms of software are considered (i.e., less hot fixes, etc.)?
</code></pre>
This is really cool, btw
Thanks!<p>> Are you seeing a lower number of production hot fixes?<p>Yes, with E2E tests in general: They are more effective at stopping incidents than other tests, but they require more effort to write. In my estimation, we prevent about 2-3 critical bugs per month from being merged into main (and consequently deployed).<p>For this project specifically: I think the critical bugs would have been caught in our overnight full E2E run anyway. The biggest gain was that E2E tests took too much time in the pipeline, and finding the root cause of bugs in nightly tests took even more time. When a test fails in the PR, we can quickly fix it before merging.<p>> Have you tried other models? Or thought about using output from different models and combining the results when they differ?<p>Not yet, but I think we need to start experimenting. Claude went offline for 30 minutes over the last 2 days, and engineers were blocked from merging because of it. I'm planning to add claude-code-router as a fallback.
But doesn’t that defeat (part of) the purpose of the E2E test? i.e. you want to test _unrelated_ parts of the system that might have broke.
Sometimes. In my experience they bifurcate into:<p>- Smoke tests (does this page load at all)
- More narrow tests (does this particular flow work)<p>I think you're referring to smoke tests, but you likely always want to run smoke tests. It's the narrow tests that you are safe with removing.
I speak for a more distributed, microservices ecosystem, but the core issue is having a “fat” E2E testing suite in the first place. Since take hours to run, it forces a “batched” testing approach where a bunch of PRs are tested together (nightly etc) and leads to very slow feedback cycles for developers. Ideally I would shift to having a very small suite of E2E tests that can be run for every PR and then have service specific api tests that test the integration of the service being changed with the rest of the system. This way you de-centralize the testing as well and are closer to achieving true continuous delivery of each change to production.
This is called Test Impact Analysis and it is something worth making deterministic. Like with an algorithm, and without an LLM.
And people have already done this.
For example: SeaLights is a product that does this.
As other comments have said, I'd prefer other solutions to get by all the tests to run faster. It would be interesting to see if it could be used to prioritise tests - get the tests more likely to fail to run sooner.
I can appreciate the effort put into the goal of optimization shared in the post, even if I disagree with the conclusions. All of that effort would be much better directed at doing a manual (or LLM-assisted) audit of the E2E tests and choosing what to prune to reduce CI runtime.<p>DHH recently described[0] the approach they've taken at BaseCamp, reducing ~180 comprehensive-yet-brittle system tests down to 10 good-enough smoke tests, and it feels much more in spirit with where I would recommend folks invest effort: teams have way more tests than they need for an adequate level of confidence. Code and tests are a liability, and, to paraphrase Kent Beck[1], we should strive to write the minimal amount of tests and code to gain the maximal amount of confidence.<p>The other wrinkle here is that we're often paying through the nose in costs (complexity, actual dollars spent on CI services) by choosing to run all the tests all the time. It's a noble and worthy goal to figure out how not to do that, _but_, I think the conclusion shouldn't be to throw more $$$ into that money-pit, but rather just use all the power we have in our local dev workstations + trust to verify something is in a shippable state, another idea DHH covers[2] in the Rails World 2025 keynote; the whole thing is worth watching IMO.<p>[0] - <a href="https://youtu.be/gcwzWzC7gUA?si=buSEYBvxcxNkY6I6&t=1752" rel="nofollow">https://youtu.be/gcwzWzC7gUA?si=buSEYBvxcxNkY6I6&t=1752</a><p>[1] - <a href="https://stackoverflow.com/questions/153234/how-deep-are-your-unit-tests/153565#153565" rel="nofollow">https://stackoverflow.com/questions/153234/how-deep-are-your...</a><p>[2] - <a href="https://youtu.be/gcwzWzC7gUA?si=9zL-xWG4FUxYZMC5&t=1977" rel="nofollow">https://youtu.be/gcwzWzC7gUA?si=9zL-xWG4FUxYZMC5&t=1977</a>
Agreed. When you have multiple developers working on the same code, you end up with overlapping test coverage as time goes on. You also end up with test coverage that was initially written with good intentions, but ultimately you'll later find that some of it just isn't necessary for confidence, or isn't even testing what you think it is.<p>Teams need to periodically audit their tests, figure out what covers what, figure out what coverage is actually useful, and prune stuff that is duplicative and/or not useful.<p>OP says that ultimately their costs went down: even though using Claude to make these determinations is not cheap, they're saving more than they're paying Claude by running fewer tests (they run tests on a mobile device test farm, and I expect that can get pricey). But ultimately they might be able to save even more money by ditching Claude and deleting tests, or modifying tests to reduce their scope and runtime.<p>And at this point in the sophistication level of LLMs, I would feel safer about not having an LLM deciding which tests actually need to run to ensure a PR is safe to merge. I know OP says that so far they believe it's doing the right thing, but a) they mention their methodology for verifying this in a comment here[0], and I don't agree that it's a sound methodology[1], and b) LLMs are not deterministic and repeatable, so it could choose two very different sets of tests if run twice against the exact same PR. The risk of that happening may be acceptable, though; that's for each individual to decide.<p>[0] <a href="https://news.ycombinator.com/item?id=45152504">https://news.ycombinator.com/item?id=45152504</a><p>[1] <a href="https://news.ycombinator.com/item?id=45152668">https://news.ycombinator.com/item?id=45152668</a>
That's why we have smoke tests. Let your full suite run later.
> ...Claude Code strategically examines specific files, searches for patterns, traces dependencies, and incrementally builds up an understanding of your changes.<p>So, it builds a dependency graph?<p>I've been playing with graph related things lately and it seems like there might be more efficient ways to do this than asking a daffy robot to do the job instead of a specific (AI crafted?) tool.<p>One could even get all fancy with said tool and use it to do fun and exciting things with the cross file dependencies like track down unused includes (or whatever) to improve build times.
> And I expect these costs will drop as models become cheaper.<p>Wait, models will get cheaper?
Costs are decreasing due to technological advancements, increased model efficiency, and the availability of smaller, more specialized models.<p>Large frontier models remain expensive, but costs for powerful yet cost-effective models have dropped significantly, and there’s no reason to believe this trend won’t continue.<p>Specific reasons for cost reduction:<p>- Smaller, More Efficient Models<p>- Improved Training and Hardware<p>- Mixture of Experts (MoE) Architecture<p>- Specialization<p>Evidence of cost declines include: price drops, cost-effective edge deployments, commoditization of core capabilities.
Next up: How I saved 53% of storage space by asking an LLM "will this be needed in the future?" for every write request
> We need coverage because missing a critical test could let bugs slip through to production.<p>Coverage for the sake of coverage doesn't solve much, right?<p>Like... you need setup, act, assertions for it to be worthwhile?
All E2E tests should run before deploy. Probably on every commit on develop branch, even. But there really is no need to run all E2E suite on every PR. In this case, the failure mode of this system, where PR automation failed to flag a breakage, is acceptable if it’s rare enough, so probabilistic solutions are OK.
We still run E2E tests before deployment, but running them on pull requests also eliminates the question: "We want to deploy, but have a bug. Which PR caused it? and how do we fix it?" This approach essentially saves you from having to perform a git bisect and engineers getting blocked because there is a bug unrelated to their task.
Yeah this. You can commit optimistically and worst case can always revert a handful of commits eg on the nightly builds running e2e and manually root cause the breaking commit(s).
>Overall, we're saving money, developer time, and preventing bugs that would make it to production. So it's a win-win-win!<p>"We" and "win" are both doing a lot of heavy-lifting here, as they are whenever talks about LLM labor destruction.
I mean, ok, maybe in a general sense, but do you really think the company is going to hire and employ someone just to sit and decide what E2E tests to run on every PR? No, of course not. What's going to happen instead is that either 1) all tests will continue to run, and productivity will continue to drop, or 2) PR authors will manually decide what tests to run, wasting their time (and they'll get it wrong sometimes, or have subtle unconscious biases that make them less likely to run the tests that take a longer amount of time, and/or the ones more likely to surface bugs in their PR).<p>So this is in theory a good thing: an LLM replacing a tedious task that no one was going to be hired to do anyway.<p>And besides, labor destruction could be a truly wonderful thing, if we had a functional, empathetic society in which we a) ensure that people have paths to retrain for new jobs while their basic needs are met by the state (or by the companies doing the labor destruction), and/or b) allow people whose jobs are destroyed to just not work at all, but provide them with enough resources to still have a decent life (UBI plus universal health care basically).<p>My utopia is one where there is no scarcity, and no one has to work to survive and have a good life. If I could snap my fingers and eliminate the need for every single job, and replace it with post-scarcity abundance, I would. People would build things and do science and help others because it gives them pleasure to do so, not because they have to in order to survive. And for people who just want to live lives of leisure, that would be fine, too. I don't think humanity will ever get to this state, mind you, but I can dream of a better world.
Flagged because:<p>> what if we could run only the relevant E2E tests<p>The real title should be "Using Claude Code to Reduce E2E Tests by 84%."