Letting AI play my game – building an agentic test harness to help play-testing

(blog.jeffschomay.com)

81 points by jschomay4 hours ago

12 comments

haunter3 minutes ago
Is there an AI which can "solve" the Path of Exile 1/2 passive skill tree yet?
justindz2 hours ago
What a great lunch read! I've been weekend-warrioring a terminal-based CRPG for a bit myself. I was recently exploring ways to use agents to help with balance testing, which is a real scale problem for solo indie dev. So far, all I've created is a fight simulator: essentially, have the current player state (stats, effects, gear, companions, etc.) do this fight, simulated, X number of times using one of the currently-implemented GOAP personalities and report how often it wins, loses, average end turn, stuff like that.I hadn't really thought about trying to create a harness for agents to play the full game interactively. I'd love to explore this. If you don't mind, here are a few questions:1) Correct to assume that I probably need a text-only harness even though my game is text-based already because I do make use of menu selections made via arrow-key-and-enter interactions?2) Do you have prompt recommendations for the type of feedback you have found to be useful? I would guess in your case, the objectives of the game are more clear than an open-world RPG. What dead ends have you run into? Maybe a variety of approaches would be good? One agent tries to fight everything. Another focuses on gaining and completing as many quests as possible?3) How bad is the token burn doing this? Any optimization strategies you've employed?
- lubujackson39 minutes ago
 I did something similar, but instead of having the LLM play the game I had it build an entire bot system to play the game. Bots require much more determinism, but I'd rather burn tokens encoding problem solving approaches and bot decision profiles than using LLMs for every turn of the game. This can be developed rapidly if you create an agent in a loop and say "figure out how to have the bot reach room 3 in under 10 actions" or something like that. It is easy for this to get bloated, but I found it makes a nice feedback loop that allows me to quickly test things like pacing changes and think of the game as a series of user actions that can be sculpted purposefully.
 - justindz23 minutes ago
 Thanks, this is another great idea and I'll consider it as an addition or alternative. Do you think this works in an open-world, non-linear type game?
jongalloway242 minutes ago
I've been doing this lately, building a Godot game with Copilot CLI. I'm using Godot MCP Pro which can automate interactions and screenshots, and have the whole game script in a markdown doc. I was happily surprised when I asked for a walkthrough and it all just worked, found and fixed some regressions while I was sleeping.
StephenAshmore1 hour ago
I've been doing something similar on my own weekend game! I've got two games in rust I'm working on, a simple one in tauri and a more traditional 2D game. For both, I added a CLI that allows me or AI to play the game and test. It hooks into the actual game state just like here as another way to "render" the game. I think this is pretty similar to end-to-end testing strategies, but with the current state of AI you can have really interesting testing while you're building something. I appreciate starting a fresh AI with no context on the game and giving it just instructions on how to use the CLI. It's an extra pair of eyes for rubber-ducking.
squeegmeister2 hours ago
I recently added E2E tests in my game too. One of the benefits is that I can have my agent verify its own work by asking it write a test and look at screenshots. Which means I can say “I’m going to bed, implement this and verify it with e2e tests” and it gets further along than it used to
Jabrov2 hours ago
I can’t wait until the distant future where strategy games will have actually good and interesting AI that can communicate and reason
chrisweekly4 hours ago
This is awesome. Thanks for sharing! The text-based renderer reminds me of playing Larn on my dad's VT100 when I was a child (early 80s).
shnippi2 hours ago
This is sick, thanks for sharing! We've been working on very similar things for the past 2 years. We also started with a text-only representation, but sadly quickly realized that only a small subset of games work well with this.So we went down a rabbit hole and decided to do everything purely based on pixels and OS inputs.We're currently only live for mobile but happy to give you early access to nunu ai for PC if interested. Would love to see how we compare!
zoetaka383 hours ago
Built something similar for E2E web testing recently. A few observations from running an agentic test harness in production:1. The single biggest jump in test quality came from giving the agent BOTH source code analysis AND live browser snapshots, not either alone. With code-only the agent hallucinates selectors; with browser-only it misses project conventions. Two MCP servers feeding the same agent — one local file-read, one Playwright in-process — was the architecture that worked.2. For the browser snapshot tool, returning the raw DOM ate tens of thousands of tokens per call and the agent struggled to navigate it. Swapping to accessibility-tree refs (e1, e2, ...) cut token usage by ~10x and made the agent reliably target the right elements.3. We avoided Docker-based MCP servers in production (we run on ECS Fargate). The in-process SDK MCP pattern (create_sdk_mcp_server + @tool decorator) keeps the browser handle in scope of the tool definition, which let us attach page.on('console') listeners and have the agent read them via a separate tool. Hard to do that across stdio process boundaries.For game testing specifically — your text-renderer detail is interesting because it sidesteps the visual-grounding problem (how does the agent verify what it's seeing?). Curious how you'd extend this to a 2D/3D rendered game where the screen state isn't easily textualized.
Modified30191 hour ago
My earliest desire for real AI was so it could control my dumb fucking harvester in C&C95.
empath752 hours ago
I hooked up an MCP server to a MUD and got some pretty amazing results, including Claude Code agents in separate windows chatting with each other and cooperating on building out a new section.
- ramses01 hour ago
 Do share, pray tell! Which MUD were you using? I've been poking around at MUD/MOO-adjacent capabilities and am having to hold the AI back from authoring it's own MUD/MOO capabilities instead of dorking with an existing server (likely that's full of security holes and complex bespoke startup+install configurations)I'd like `mud_or_moo --state-dir ./tmp/some-mud` which stored most things as plain text or maybe SQLite if really necessary? The core of a MUD which was conceptually similar to a wiki-browser against markdown files (ie: room-001.md => exits => room-002.md) is what i'm angling towards, such that _editing and linking_ felt more comfortable and GUI to a human user.
 - empath751 hour ago
 I forked evennia and added it. Took me a few hours with claude.Once i had the core authorship mcp's working, claude itself created the whole world, including an initial tutorial sequence, combat, etc...
 - ticulatedspline16 minutes ago
 Cool, I was thinking about this very thing. Was looking at CoffeeMud and wondered if I gave it a starting room and a clean slate if it could basically just build out a whole Mud from scratch.
builderminkyu2 hours ago
[dead]