Launch HN: TesterArmy (YC P26) – Agents that test web and mobile apps

(tester.army)

88 points by okwasniewski7 hours ago

21 comments

poisonborz5 hours ago
E2E tests are now quick to write due to LLMs, and are then deterministic AND cheap to run. How would this compare to the token costs of running an agent the whole time for each test? How do you make sure results stay stable regardless of the nondeterministic nature? Do customers still need to create test cases - any way to import from test case management system - based on which they could have already generate e2e tests locally?
- okwasniewski5 hours ago
 Unfortunately from our experience tests don’t scale as well as code. First of all, static tests are very brittle: you rely on selectors, need wait times, and can’t really test a lot of dynamic content (think AI chats/interactions). Then it’s all the infrastructure around it: solving captchas, handling auth, handling email OTP (each of our agents has access to its own inbox) and handling video recording and screenshots.To ensure stable results we do a lot of harness engineering, where we inject trajectories of previous tests to ensure the stability and also the split into smaller steps helps to prevent context overload and decision fatigue.Regarding test case management, our customers have used our CLI to migrate their existing test cases from whatever system they were using before.
 - ai_slop_hater4 hours ago
 Why can't you test AI chats?
dbbk7 hours ago
"Traditional E2E tests are slow to set up and expensive to maintain." I don't really understand this. If I'm already using Opus to write the code, surely it would know best what E2E tests to write to be able to verify its own output? This seems like an unnecessary external step.
- okwasniewski7 hours ago
 Unfortunately from our experience tests don’t scale as well as code. First of all static tests are very brittle, you rely on selectors, need wait times and can’t really test a lot of dynamic content (think AI chats/interactions). Then it’s all the infrastructure around it: solving captchas, handling auth, handling email OTP (each of our agents has access to its own inbox) and handling video recording and screenshots. So with the traditional testing approach you end up mocking a lot of services. I highly recommend you to give it a try!
 - Obertr27 minutes ago
 I would respectfully disagree on this. How i write tests right now I ask claude/codex to create an eval and it just spins up a bg LLM agent worker which verifies the tests in the sandbox/internally.So i would say that atm in house testing is easier than external testing for us
RayFitzgerald5 hours ago
Love your approach to product. It feels like TesterArmy will become the "Vercel for testing". Refreshing stuff!
- okwasniewski5 hours ago
  Thank you! That's the goal
pensono2 hours ago
Love using tester army to validate PRs against my preview environment. Skips the manual check much of the time and helps me ship more confidently.
- okwasniewski1 hour ago
  Happy to hear that!
msencenb7 hours ago
Have you been able to nail down a loop where your tool can take an open pr, guess the code path and do some testing?We use cypress heavily for our core flows which has a similar ai prompt thing but it’s not quite ad hoc enough for smaller fixes which is where the bottleneck still comes in for us.
- okwasniewski6 hours ago
 Yes! We spent quite a lot of time on this, and we are currently creating a test plan based on PR changes and sending an agent to verify it. We have some customers who are only using this feature.
_pdp_4 hours ago
Great presentationOn a slight tangent, since we are all here...Does anyone still believe there is a long-term future in traditional UI/UX?It feels like a lot of attention is still going into landing pages, dashboards, and CRUD apps, while overlooking a bigger shift where fewer people will actually need to interact with those interfaces directly when the same tools can perform the underlying tasks automatically, without much UI at all.So the bigger question is does UI/UX evolve into something else, or does a large part of it simply disappear?I might be a bit too early. Recently I started a project and decided to skip all of that and focus to make it more friendly to AI agents and frankly so far it has been great purely from user experience but also what it delivers.
- altmanaltman1 hour ago
 How can it perform tasks automatically? It's not magic, there has to be an UI/UX for interacting with it. Will that UI/UX be more optimized and easier to use is the question. Like would you prefer saying "close window computer" or press alt+f4 or just click on the little cross thing or equivalent. Why are we assuming all AI automagic UI/UX will be better for all tasks?
 - _pdp_32 minutes ago
 AI agents can perfectly do a lot of the data entry tasks and build dashboards. You practically need to build none of that when you can ask an AI agent to pull the data and build a chart or provide a file or a paste to insert into a database.Basically that.If the app requires a mouse then it should have UI, if not, unless critical, it can be driven by an agent.That's my point.
antifarben4 hours ago
What are people using to test mobile apps on self hosted infrastructure nowadays? Is there a solution that's not super heavy and/or slow?
peterspath3 hours ago
Does it support testing on all Apple platforms (macOS, iOS, iPadOS, watchOS, tvOS, and visionOS)?
- okwasniewski1 hour ago
  Currently, only iOS, but we can add iPadOS too!
tcoff915 hours ago
I'm curious how your mobile testing compares to <a href="https://revyl.com">https://revyl.com</a>I've been experimenting with Revyl and it's really nice. I think this agent-driven testing is the future.
- okwasniewski5 hours ago
 We support both web and mobile, which is what a lot of companies prefer, just one agent for both. Also, I'm pretty sure Revyl relies only on vision models, which tend to be slower. We built the platform around a hybrid approach that combines vision and accessibility APIs, which is much faster.Would love to hear your feedback after you try it out!
Laurel12346 hours ago
Seems interesting, but I wonder about this> Traditional E2E tests are slow to set up and expensive to maintain.Isn't this just using agents to create e2e tests or is there some better new approach I'm missing?
- okwasniewski6 hours ago
 We use agents to navigate the app, making real-time decisions based on its state. I prefer to compare it more to a manual QA engineer than to static e2e tests. We spent a lot of time on the harness to make sure the results are reliable. This allows you to assert on dynamic content like AI-generated content. We also support validation of email flows since the agent can read its own email.
 - jaggederest5 hours ago
 Fable (rip) is absurdly good at this, great time to build a product around it, you definitely need the harness, but it feels like it just turned the corner to be able to do really in depth and edge case work.Do you handle heterogenous environments and network connectivity simulation as well? I am working on a mobile app and occasionally having users just lose a request or two can put the state machine into unusual modes.
 - okwasniewski5 hours ago
 I feel like new AI model releases will only allow our agents to do more in-depth testing; the space still has a lot of room to grow. Quality assurance is way more complicated than just clicking around a UI.Regarding the other question: not yet. For now, we have Chromium, iOS, and Android (latest versions of each), but we are working on adding more. Regarding network connectivity, it's coming soon (I have an open PR).
yohguy7 hours ago
Does it work of mobile native applications or expo apps that have native modules?Pricing question, the usage on the plans seems low considering in the demo you said that you have 25 tests per pr which would mean you get only 10 PRs per month on the hobby plan?
- okwasniewski7 hours ago
 Yes, it works for any framework. We just get the built native binary and run it in the cloud.Regarding pricing, the self serve options are currently only for lower usage. We will add more plans further down the line. Currently the most popular one is the startup plan. If you need more usage I’m happy to discuss it on a call!
Lionga3 hours ago
The most flaky tests possible as a service. Everyone knows that no tests are better then unreliable tests.
j0sip4 hours ago
I wonder how does it compare to mobileboost.io, which has been used by some companies like Duolingo?
- okwasniewski4 hours ago
  Our approach is heavily focused on agents, both for executing tests and for managing the platform. We want to provide the best and simplest way to conduct agentic testing, with a strong focus on details. It looks like their platform also requires a sales call.
rpunkfu6 hours ago
Congratulations on launch, I’ve been tracking your progress since you’ve been accepted for spring batch.Always happy to see cool products from Poland! :)
- okwasniewski6 hours ago
 Thank you!
iknownthing6 hours ago
.army?
- okwasniewski6 hours ago
  We are thinking whether to change this.. We also have testerarmy.com/.ai
  - thih95 hours ago
    Change it now to .com or get stuck there for years, suffering anti spam filters, potential renewal problems and more in the meantime.
  - tootubular4 hours ago
    You have the .com? That's a no-brainer imo. I have a domain for a saas where the .com is squatted so I settled for .ai (and other surrounding TLDs / host permutations) and right out of the gate ran into some issues with firewall vendors in corpo environments.
    - okwasniewski4 hours ago
      Yeah, we have all of them. I saw it too where in bigger companies our emails were going straight to spam. Will migrate to it soon
zuzululu4 hours ago
not sure the pain point you mentioned resonate. with LLMs its very easy to do E2E testing. also I feel uneasy about outsourcing this part with all the security issues these days.
- okwasniewski4 hours ago
 Unfortunately from our experience tests don’t scale as well as code.First of all, static tests are very brittle: you rely on selectors, need wait times, and can’t really test a lot of dynamic content (think AI chats/interactions). Then it’s all the infrastructure around it: solving captchas, handling auth, handling email OTP (each of our agents has access to its own inbox), spinning up simulators and handling video recording and screenshots.To ensure stable results we do a lot of harness engineering, where we inject trajectories of previous tests to ensure the stability and also the split into smaller steps helps to prevent context overload and decision fatigue.Regarding security part, the product can operate solely without any access to the codebase, you can just give us a URL or a mobile app build and we will do the testing.
 - skinfaxi4 hours ago
 Goodness I really didn't expect such lazy copy-pasting of responses for a YC company.
amitpatole4 hours ago
[flagged]
DipCoy4 hours ago
[flagged]
maxothex6 hours ago
[flagged]
KaiShips3 hours ago
[flagged]