Using Claude Code SDK to reduce E2E test time

(jampauchoa.substack.com)

104 points by jampa1 day ago

21 comments

meindnoch1 day ago
You didn't reduce e2e test time. You've reduced the e2e test coverage by only running those tests that the LLM tells you to run.
- un_montagnard1 day ago
 Reminds me that time when I was doing some POC on an existing code base and I reduced build time by deleting all unit tests
- sghiassy1 day ago
 +1At Meta we do similar heuristics on which tests to run per PR. When the system gets it wrong, which is often, its painful and leads to merged code that broke unrun tests
 - unshavedyak1 day ago
 Fwiw i feel like you could still do both, no? Ie run all tests daily or whatever cadence is not painful but still protective, but _also_ use heuristics to improve PR responsiveness.Like anything it's about tradeoffs. Though if it were me i'd simply write a mechanism which deterministically decides which areas of code pertain to which tests and use that simple program to determine which tests run. The algorithm could be loosely as simple as things like code owners and git-blame, but relative to some big set of Code->Test that you can have Claude/etc build ahead of time. The difference being it's deterministic between PRs and can be audited by humans for anything obviously poor or w/e.As much as LLMs are interesting to me i hate using them in places where i want consistency. CI seems terrible for them.. to me at least.
 - sevazhidkov23 hours ago
 To be fair, the most interesting bugs that can be caught with tests always feel like “I would’ve never guessed that part of system actually depends on this one”.
 - andsoitis15 hours ago
 > When the system gets it wrong, which is oftenWhich suggests there might be big room for improvement. If LLMs open up new ways to be more precise about which E2E to run, that is a significant benefit, because you’d manage risk but reduce the time and resources it takes.
- manojlds1 day ago
 Why do you respond to the title when that's addressed in the opening.
 - CodeMage14 hours ago
 No, it's not. The key concept here is the reduction of coverage.The opening part of the post makes certain claims -- "catch bugs before they ship" and "keep our master branch always clean" -- that can only be true if the coverage is not reduced. But, as other replies point out, if the heuristic for test selection has false negatives, the coverage is effectively reduced, which can result in bugs.That problem can be bad enough when the heuristic is deterministic. AI throws determinism out of the window, which only exacerbates the problem.
- spaceywilly1 day ago
 Yeah I’m not seeing any evidence that this actually works. I would’ve like to see some testing where they intentionally introduce a bug (ideally a tricky bug in a part of code that isn’t directly changed by the diff) and see if Claude catches it.A good middle ground could be to allow the diff to land once the “AI quick check” passed, then keep the full test suite running in the background. If they run them side by side for a while and see that the AI quick check caught the failing test every time, I’d be convinced.
jryio1 day ago
This is a hacky joke. No sane engineer would ever sign off on this. Even for a 1-5 person team, why would I want a probabilistic selection of test execution?The solution to running only e2e tests on affected files has been around long before LLM. This is a bandage on poor CI.
- johnfn1 day ago
 I have worked at large, competent companies, and the problem of "which e2e tests to execute" is significantly more complicated than you seem to suggest that it is. I've worked with smart engineers that put a lot of time into this problem to only get only middling results.
 - Yoric1 day ago
 ...and I'm not confident at all that Claude can do anything at that level.
 - johnfn1 day ago
 How does that reconcile with the article, which states:> Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry.If you have some particular issue with the author's methodology, you should state that.
 - cerved1 day ago
 Well since it never broke for some rando on the internet, surely that means it will always work for everyone
 johnfn1 day ago
 If you have some particular issue with the article, you should state that. Otherwise, the most charitable interpretation of your position I can come up with is "the article is wrong for some reason I refuse to specify", which doesn't lead to a productive dialogue.
 ambicapter23 hours ago
 I think you're the one being uncharitable here. The meaning of what he's saying is very clear. You can't say this probabilistic method (using LLMs to decide your e2e test plan) works if you only have a single example of it working.
 johnfn19 hours ago
 It's really not clear. Using probabilistic methods to determine your e2e test plan is already best practice at large tech shops, and to be quite honest the heuristics that they used to use were pretty poor and arbitrary.
 - bgwalter1 day ago
 If the author can keep the whole function code_change -> relevant E2E_TESTS in his head, it seems to be a trivial application.We don't know the methodology, since the author does not state how he verified that function or how he would verify the function for a large code base.
 - troupo1 day ago
 Easy. The article asks us to believe.There's a handy list to check against the article here: <a href="https://dmitriid.com/everything-around-llms-is-still-magical-and-wishful-thinking" rel="nofollow">https://dmitriid.com/everything-around-llms-is-still-magical...</a> starting at "For every description of how LLMs work or don't work we know only some, but not all of the following"
 johnfn1 day ago
 It seems to me like we have the answers to all those questions.- Do we know which projects people work on?It's pretty easy to discover that OP works on <a href="https://livox.com.br/en/" rel="nofollow">https://livox.com.br/en/</a>, a tool that uses AI to let people with disabilities speak. That sounds like a reasonable project to me.- Do we know which codebases (greenfield, mature, proprietary etc.) people work onThe e2e tests took 2 hours to run and the website quotes ~40M words. That is not greenfield.- Do we know the level of expertise the people have?It seems like they work on nontrivial production apps.- How much additional work did they have reviewing, fixing, deploying, finishing etc.?The article says very little.
 troupo23 hours ago
 > The article says very little.And that's the crux, isn't it. Because that checklist really is just the tip of the iceberg.Some people have completely opposite experiences: <a href="https://news.ycombinator.com/item?id=45152139">https://news.ycombinator.com/item?id=45152139</a>Others question the validity of the approach entirely: <a href="https://news.ycombinator.com/item?id=45152668">https://news.ycombinator.com/item?id=45152668</a>Oh, don't get me wrong: I like the idea. I would trust LLMs with this idea about as far as I could throw them.
- jampa1 day ago
 I think you might be confusing end-to-end (E2E) tests with other types of testing, such as unit and integration tests. No one is advocating this approach for unit tests, which should still run in their entirety on every pull request.Running all E2E tests in a pipeline isn't feasible due to time constraints (takes hours). Most companies just run these tests nightly (and we still do). Which means we would still catch any issues that slip through the initial screening. But so far, nothing did.
- trenchpilgrim1 day ago
 > The solution to running only e2e tests on affected files has been around long before LLM.This doesn't work in distributed systems, since changing the behavior of one file that's compiled in one binary can cause a downstream issue in a separate binary that sends a network call to the first. e.g. A programmer makes a behavioral change to binary #1 that falls within defined behavior, but encounters Hyrum's Law because of a valid behavior of binary #2.
 - hamandcheese1 day ago
 That's easy:- avoid distributed systems at all costs- if you can't avoid them, never make breaking API changes
 - madeofpalk1 day ago
 Determining breaking API changes is the whole point of tests.
 - trenchpilgrim1 day ago
 While we're at it, give things good names and don't invalidate caches at the wrong times!
 - lukan23 hours ago
 Also always keep your documentation updated and complete.
- hamandcheese1 day ago
 I would sign off on it. The only evidence I would need to see is some analysis of whether the risks are worth the benefits.Risks: missing e2e tests that should have run letting bugs into production, more time spent chasing down flakes due to non determinism.Benefits: increased productivity, catch bugs sooner (since you can run e2e tests more often).
brynary1 day ago
Historically, this kind of test optimization was done either with static analysis to understand dependency graphs and/or runtime data collected from executing the app.However, those methods are tightly bound to programming languages, frameworks, and interpreters so they are difficult to support across technology stacks.This approach substitutes the intelligence of the LLM to make educated guesses about what tests execute, to achieve the same goal of executing all of the tests that could fail and none of the rest (balancing a precision/recall tradeoff). What’s especially interesting about this to me is that the same technique could be applied to any language or stack with minimal modification.Has anyone seen LLMs in other contexts being substituted for traditional analysis to achieve language agnostic results?
- wer232essf1 day ago
 That’s a great point — the portability angle is definitely one of the most intriguing aspects here. Traditional static analysis tools are powerful but usually brittle outside of their home ecosystem, and the investment required to build/maintain them across multiple languages is huge. Using an LLM essentially offloads that specialization into the model’s “latent knowledge,” which could make it a lot more accessible for polyglot teams or systems with mixed stacks.One interesting side effect I’ve noticed when experimenting with LLM-driven heuristics is that, while they lack the determinism of static/runtime analysis, they can sometimes capture semantic relationships that are invisible to traditional tooling. For example, a change in a configuration file or documentation might not show up in a dependency graph, but an LLM can still reason that it’s likely to impact certain classes of tests. That fuzziness can introduce false positives, but it also opens the door to catching categories of risk that would normally be missed.I think the broader question is how comfortable teams are with probabilistic guarantees in their workflows. For some, the precision/recall tradeoff is acceptable if it means faster feedback and reduced CI bills. For others, the lack of hard guarantees makes it a non-starter. But I do see a pattern emerging where LLM-based analysis isn’t necessarily a replacement for traditional methods, but a complementary layer that can generalize across stacks and fill in gaps where traditional tools don’t reach.
meisel1 day ago
I don’t think you truly get the “best” of both worlds, because the rate of accidentally omitting a broken test and letting something slip into master is now non-zero (flaky tests aside). This is still a tradeoff. But maybe it’s a good one!I do wonder if this is as feasible at scale, where breaking master can be extremely costly (although at least it’s not running all tests for all commits, so a broken test won’t break all CI runs). Maybe it could be paired with, say, running all E2E tests post-merge and reporting breakages ASAP.
- cerved1 day ago
 I think the point is, E2E takes too long, so it was only run nightly. Now it's also run, selectively, as well as nightly
- meisel1 day ago
 Another idea would be to still run the rest of the E2E tests pre-merge, but as a separate job that only makes itself known if a failure occurs.
johnfn1 day ago
> The key phrase here is "think deep". This tells Claude Code not to be lazy with its analysis (while spending more thinking tokens). Without it, the output was very inconsistent. I used to joke that without it, Claude runs in “engineering manager mode” by delegating the work.This section really stood out to me. I knew that asking GPT-5 to think gets better results, but I didn't know Claude had the same behavior. I'd love to see some sort of % success before and after "think deep" was added. Should I be adding "think deep" to all my non-trivial queries to Claude?
- calesennett1 day ago
 In Anthropic's Claude Code Best Practices [1] (section 3.a.2), they note prompting with certain phrases corresponds to increasing levels of "thinking" for Claude Code:"think" < "think hard" < "think harder" < "ultrathink"[1] <a href="https://www.anthropic.com/engineering/claude-code-best-practices" rel="nofollow">https://www.anthropic.com/engineering/claude-code-best-pract...</a>
 - furyofantares1 day ago
 Since this is just a keyword, I'd expect you to prepend the entire prompt with a single line that says "think deep" or whichever think-level you want, rather than put "think deep" somewhere that makes sense in english within the prompt, as they do in the article.
 - furyofantares19 hours ago
 To be clear I think this is an inference-level feature that looks for the string and performs inference at a higher thinking level, rather than a feature the model is trained on. Although I could be wrong, or it could be both.
 - johnfn1 day ago
 Wow! Thanks for sharing - I didn't know this!
 - lupusreal1 day ago
 Man this is useful, I wish I knew earlier. I should really read the documentation. Or have Claude read it for me.
- skippyboxedhero1 day ago
 There is "think" and "super think" (there may be something in the middle).If you use the word "think" at any point, this mode will also trigger...which is sometimes inconvenient.Plan mode is probably better for these situations though. Claude seems to have got much worse recently but even before, if it was stuck, then asking it so "super think" never dislodged it (presumably because AI agents have no capacity for creativity, telling them to think more is just looping harder).The loop of plan -> do is significantly more effective as most AI agents that I have used extensively will get lost on trivial tasks if you just have the "do" phase without a clear plan (some, such as GPT, also appear unable to plan effectively...we have a contract with OpenAI at work, I have no idea how people use it).
 - aaviator421 day ago
 <pre><code> think / think hard / think harder / ultrathink </code></pre> These are all valid claude commands/tokens to enhance its "thinking" abilities.
- troupo1 day ago
 > I'd love to see some sort of % success before and after "think deep" was added. Should I be adding "think deep" to all my non-trivial queries to ClaudeAny such % would be meaningless because both are non-deterministic blackboxes with literally undefined behavior. Any % you'd be seeing could be just differences in available compute as the US is waking up
 - johnfn23 hours ago
 I am describing a simple eval. This is a strategy that determines how effective an AI system is, they have existed long before LLMs were ever a thing, and they are perfectly happy to deal with non-deterministic behavior. You are acting as if LLMs being non-deterministic means you can't say anything about them, but before LLMs were non-deterministic, traditional ML systems were non-deterministic, and we had no problem building systems to probabilistically evaluate them.
 - troupo23 hours ago
 A "simple eval" would give you nothing. Anthropoc running out of available compute will affect the output significantly more than a change in the model or "thinking".> we had no problem building systems to probabilistically evaluate them.Something tells me the "no problem" in that statement is from Death of Stalin when you get to the actual details of that evaluation: <a href="https://youtu.be/kasSSZlBFDs?si=-5kv7LPWi_YStR3C" rel="nofollow">https://youtu.be/kasSSZlBFDs?si=-5kv7LPWi_YStR3C</a>
 - johnfn19 hours ago
 Do you have any evidence to support your claim that "Anthropoc running out of available compute will affect the output significantly" has ever actually happened in practice? To be honest, it sounds like superstition.
 troupo14 hours ago
 It's called speculation, not superstition.I've already been in several situations when I'd run the same prompt over the same code with completely different results (one near genius, one near moronic). The differentiating factor? Genius was Stockholm morning, moron was Stockholm midday/afternoon.
politelemon1 day ago
> Yes, and I'm not exaggerating. Claude never missed a relevant E2E test.I was waiting to see this part demonstrated and validated - for a given pr, whether you created an expected set of tests that would run and then compare it to what actually ran. Without a baseline, as any tester would tell you, the LLM output has been trusted without checking.
- jampa1 day ago
 I should have added that to the post, but we verify by running a full test suite in a cron job, so if a bug slips, we will know. The full suite sometimes breaks, but these are due to changes outside of PR (e.g., a vendor goes down).
 - kelnos1 day ago
 That doesn't verify what you think it does, though. It doesn't verify that the LLM ran all the tests necessary to exercise the code paths that were changed. "Running all necessary tests" is indistinguishable from "we did a good job on these PRs and didn't write any new bugs that would have been caught by the tests". This is the classic situation where you can prove a negative (if all the LLM-selected tests pass, but there's a new failure in the full suite), but can't prove a positive (if all the LLM-selected tests pass, and the full suite also passes, it could just mean there were no new bugs, not that the LLM selected the right tests).The only way to verify that is to do what GP suggested: for each PR, manually make a list of the tests that you believe should be run, and then compare that to the list of tests that the LLM actually runs. Obviously you aren't going to do this forever, but you should either do it for every single PR for some number of PRs, or pick a representative sample over time instead.
CuriouslyC1 day ago
Don't do this. If you have test time issues that are clogging the build pipeline, you can run a merge queue that creates ghost branches to test patches in sets rather than one by one.
- vinnymac1 day ago
 Agreed, this is just bad engineering. I hope we don’t see more ideas like this bleed into actual release processes.Outside of a fun project to tinker on and see the results, I wouldn’t use this for anything that runs in production. It is better to use an LLM to assist you in building better static analysis tools, than it is to use the proposed technique in the article.
vladdoster1 day ago
Few questions:<pre><code> 1. Are you seeing a lower number of production hot fixes? 2. Have you tried other models? Or thought about using Output of different models and using combining the output if they have a delta. 3. Other than time & cost, what benchmarks in terms of software are considered (i.e., less hot fixes, etc.)? </code></pre> This is really cool, btw
- jampa1 day ago
 Thanks!> Are you seeing a lower number of production hot fixes?Yes, with E2E tests in general: They are more effective at stopping incidents than other tests, but they require more effort to write. In my estimation, we prevent about 2-3 critical bugs per month from being merged into main (and consequently deployed).For this project specifically: I think the critical bugs would have been caught in our overnight full E2E run anyway. The biggest gain was that E2E tests took too much time in the pipeline, and finding the root cause of bugs in nightly tests took even more time. When a test fails in the PR, we can quickly fix it before merging.> Have you tried other models? Or thought about using output from different models and combining the results when they differ?Not yet, but I think we need to start experimenting. Claude went offline for 30 minutes over the last 2 days, and engineers were blocked from merging because of it. I'm planning to add claude-code-router as a fallback.
ares6231 day ago
But doesn’t that defeat (part of) the purpose of the E2E test? i.e. you want to test _unrelated_ parts of the system that might have broke.
- johnfn1 day ago
 Sometimes. In my experience they bifurcate into:- Smoke tests (does this page load at all) - More narrow tests (does this particular flow work)I think you're referring to smoke tests, but you likely always want to run smoke tests. It's the narrow tests that you are safe with removing.
jaguar7521 hours ago
I speak for a more distributed, microservices ecosystem, but the core issue is having a “fat” E2E testing suite in the first place. Since take hours to run, it forces a “batched” testing approach where a bunch of PRs are tested together (nightly etc) and leads to very slow feedback cycles for developers. Ideally I would shift to having a very small suite of E2E tests that can be run for every PR and then have service specific api tests that test the integration of the service being changed with the rest of the system. This way you de-centralize the testing as well and are closer to achieving true continuous delivery of each change to production.
gregorriegler23 hours ago
This is called Test Impact Analysis and it is something worth making deterministic. Like with an algorithm, and without an LLM. And people have already done this. For example: SeaLights is a product that does this.
duncanfwalker1 day ago
As other comments have said, I'd prefer other solutions to get by all the tests to run faster. It would be interesting to see if it could be used to prioritise tests - get the tests more likely to fail to run sooner.
davemo1 day ago
I can appreciate the effort put into the goal of optimization shared in the post, even if I disagree with the conclusions. All of that effort would be much better directed at doing a manual (or LLM-assisted) audit of the E2E tests and choosing what to prune to reduce CI runtime.DHH recently described[0] the approach they've taken at BaseCamp, reducing ~180 comprehensive-yet-brittle system tests down to 10 good-enough smoke tests, and it feels much more in spirit with where I would recommend folks invest effort: teams have way more tests than they need for an adequate level of confidence. Code and tests are a liability, and, to paraphrase Kent Beck[1], we should strive to write the minimal amount of tests and code to gain the maximal amount of confidence.The other wrinkle here is that we're often paying through the nose in costs (complexity, actual dollars spent on CI services) by choosing to run all the tests all the time. It's a noble and worthy goal to figure out how not to do that, _but_, I think the conclusion shouldn't be to throw more $$$ into that money-pit, but rather just use all the power we have in our local dev workstations + trust to verify something is in a shippable state, another idea DHH covers[2] in the Rails World 2025 keynote; the whole thing is worth watching IMO.[0] - <a href="https://youtu.be/gcwzWzC7gUA?si=buSEYBvxcxNkY6I6&t=1752" rel="nofollow">https://youtu.be/gcwzWzC7gUA?si=buSEYBvxcxNkY6I6&t=1752</a>[1] - <a href="https://stackoverflow.com/questions/153234/how-deep-are-your-unit-tests/153565#153565" rel="nofollow">https://stackoverflow.com/questions/153234/how-deep-are-your...</a>[2] - <a href="https://youtu.be/gcwzWzC7gUA?si=9zL-xWG4FUxYZMC5&t=1977" rel="nofollow">https://youtu.be/gcwzWzC7gUA?si=9zL-xWG4FUxYZMC5&t=1977</a>
- kelnos1 day ago
 Agreed. When you have multiple developers working on the same code, you end up with overlapping test coverage as time goes on. You also end up with test coverage that was initially written with good intentions, but ultimately you'll later find that some of it just isn't necessary for confidence, or isn't even testing what you think it is.Teams need to periodically audit their tests, figure out what covers what, figure out what coverage is actually useful, and prune stuff that is duplicative and/or not useful.OP says that ultimately their costs went down: even though using Claude to make these determinations is not cheap, they're saving more than they're paying Claude by running fewer tests (they run tests on a mobile device test farm, and I expect that can get pricey). But ultimately they might be able to save even more money by ditching Claude and deleting tests, or modifying tests to reduce their scope and runtime.And at this point in the sophistication level of LLMs, I would feel safer about not having an LLM deciding which tests actually need to run to ensure a PR is safe to merge. I know OP says that so far they believe it's doing the right thing, but a) they mention their methodology for verifying this in a comment here[0], and I don't agree that it's a sound methodology[1], and b) LLMs are not deterministic and repeatable, so it could choose two very different sets of tests if run twice against the exact same PR. The risk of that happening may be acceptable, though; that's for each individual to decide.[0] <a href="https://news.ycombinator.com/item?id=45152504">https://news.ycombinator.com/item?id=45152504</a>[1] <a href="https://news.ycombinator.com/item?id=45152668">https://news.ycombinator.com/item?id=45152668</a>
manojlds1 day ago
That's why we have smoke tests. Let your full suite run later.
UncleEntity1 day ago
> ...Claude Code strategically examines specific files, searches for patterns, traces dependencies, and incrementally builds up an understanding of your changes.So, it builds a dependency graph?I've been playing with graph related things lately and it seems like there might be more efficient ways to do this than asking a daffy robot to do the job instead of a specific (AI crafted?) tool.One could even get all fancy with said tool and use it to do fun and exciting things with the cross file dependencies like track down unused includes (or whatever) to improve build times.
albingroen22 hours ago
> And I expect these costs will drop as models become cheaper.Wait, models will get cheaper?
- andsoitis15 hours ago
 Costs are decreasing due to technological advancements, increased model efficiency, and the availability of smaller, more specialized models.Large frontier models remain expensive, but costs for powerful yet cost-effective models have dropped significantly, and there’s no reason to believe this trend won’t continue.Specific reasons for cost reduction:- Smaller, More Efficient Models- Improved Training and Hardware- Mixture of Experts (MoE) Architecture- SpecializationEvidence of cost declines include: price drops, cost-effective edge deployments, commoditization of core capabilities.
breppp1 day ago
Next up: How I saved 53% of storage space by asking an LLM "will this be needed in the future?" for every write request
MuffinFlavored22 hours ago
> We need coverage because missing a critical test could let bugs slip through to production.Coverage for the sake of coverage doesn't solve much, right?Like... you need setup, act, assertions for it to be worthwhile?
golergka1 day ago
All E2E tests should run before deploy. Probably on every commit on develop branch, even. But there really is no need to run all E2E suite on every PR. In this case, the failure mode of this system, where PR automation failed to flag a breakage, is acceptable if it’s rare enough, so probabilistic solutions are OK.
- jampa1 day ago
  We still run E2E tests before deployment, but running them on pull requests also eliminates the question: "We want to deploy, but have a bug. Which PR caused it? and how do we fix it?" This approach essentially saves you from having to perform a git bisect and engineers getting blocked because there is a bug unrelated to their task.
- catlifeonmars1 day ago
  Yeah this. You can commit optimistically and worst case can always revert a handful of commits eg on the nightly builds running e2e and manually root cause the breaking commit(s).
catigula1 day ago
>Overall, we're saving money, developer time, and preventing bugs that would make it to production. So it's a win-win-win!"We" and "win" are both doing a lot of heavy-lifting here, as they are whenever talks about LLM labor destruction.
- kelnos1 day ago
 I mean, ok, maybe in a general sense, but do you really think the company is going to hire and employ someone just to sit and decide what E2E tests to run on every PR? No, of course not. What's going to happen instead is that either 1) all tests will continue to run, and productivity will continue to drop, or 2) PR authors will manually decide what tests to run, wasting their time (and they'll get it wrong sometimes, or have subtle unconscious biases that make them less likely to run the tests that take a longer amount of time, and/or the ones more likely to surface bugs in their PR).So this is in theory a good thing: an LLM replacing a tedious task that no one was going to be hired to do anyway.And besides, labor destruction could be a truly wonderful thing, if we had a functional, empathetic society in which we a) ensure that people have paths to retrain for new jobs while their basic needs are met by the state (or by the companies doing the labor destruction), and/or b) allow people whose jobs are destroyed to just not work at all, but provide them with enough resources to still have a decent life (UBI plus universal health care basically).My utopia is one where there is no scarcity, and no one has to work to survive and have a good life. If I could snap my fingers and eliminate the need for every single job, and replace it with post-scarcity abundance, I would. People would build things and do science and help others because it gives them pleasure to do so, not because they have to in order to survive. And for people who just want to live lives of leisure, that would be fine, too. I don't think humanity will ever get to this state, mind you, but I can dream of a better world.
 - catigula1 day ago
 Why do engineers care about productivity again?
Disposal84331 day ago
Flagged because:> what if we could run only the relevant E2E testsThe real title should be "Using Claude Code to Reduce E2E Tests by 84%."
- kelnos1 day ago
 That's not a good reason to flag; you can just comment on the bad title (as you did), or even email hn@ycombinator.com and ask them to fix it.Article flags should be reserved for things you don't believe should be on HN at all.