Letting Claude play text adventures

(borretti.me)

154 points by varjag21 days ago

29 comments

nitwit00516 days ago
I see people have put the transcripts of full adventure game playthroughs online, so it's reasonably likely games are present in the training data: <a href="https://dwheeler.com/anchorhead/anchorhead-transcript.txt" rel="nofollow">https://dwheeler.com/anchorhead/anchorhead-transcript.txt</a>You can probably find games where that's not true, as people are still releasing text adventure games occasionally.
- s-macke16 days ago
 To compensate for this I have put 100 or more terrible text adventure game runs from AI online [0]. And with this post I even link to them directly for the next web crawler.[0] <a href="https://github.com/s-macke/AdventureAI/tree/master/storydump" rel="nofollow">https://github.com/s-macke/AdventureAI/tree/master/storydump</a>
daxfohl16 days ago
I tried something similar, but distilled to "solve this maze" as a first-person text adventure, and while it usually solved it eventually, it almost always backtracked through fully-explored dead ends multiple times before finally getting to the end. I was pretty surprised by this, as I expected they'd be able to traverse more or less optimally most of the time.I tried basic raw long-context chat, various approaches of getting it to externalize the state (i.e. prompting it to emit the known state of the maze after each move, but not telling it exactly what to emit or how to format it), and even allowing it to emit code to execute after each turn (so long as it was a serialization/storage algorithm, not a solver in itself), but it invariably would get lost at some point. (It always neglected to emit a key for which coordinate was which, and which direction was increasing. Even if I explicitly told it to do this, it would frequently forget to at some point anyway and get turned around again. If I explicitly provided the key each move, it would usually work).Of course it had no problem writing an optimal algorithm to solve mazes when prompted. In fact it basically wrote itself; I have no idea how to write a maze generator. I thought the disparity was interesting.Note the mazes had the start and end positions inside the maze itself, so they weren't trivially solvable by the "follow wall to the left" algorithm.This was last summer so maybe newer models would do better. I also stopped due to cost.
tibbon16 days ago
I was inspired by the work here, so I sat down with Claude to make something similar, for the purpose of being able to play Z-Machine (Infocom games, Inform 6/7 Z-code) and modern Inform 7 games with Glulx. So far I've tested it with Andrew Plotkin’s Hadean Lands.Switchable backends, various output formats, etc.In theory, I could also likely wire this up to get it playing MUDs, but I have some reservations about running that on anything except a private server.My use case for this is to help test and evaluate Interactive Fiction in development, and you could even run it as a CI/CD process.It's not perfect (so much Claude Coding of this), but it's an ok start for an hour on the couch: <a href="https://github.com/tibbon/gruebot" rel="nofollow">https://github.com/tibbon/gruebot</a>
pflenker16 days ago
For a game like anchorhead, which is famous in its niche, shouldn’t Claude already know it sufficiently to just solve it right away? I would expect that its data source contained multiple discussions and walkthroughs of the game.
- zetalyrae16 days ago
 I expect it's somewhere in the training data, but it's very unlikely to be salient. A few textfiles here and there in the ocean of the Internet is nothing. If Claude had memorized the walkthrough, it would have performed better.
- vunderba16 days ago
 I would think so. I'd be far more interested in a comparison of LLMs (no internet search allowed) playing against IF games released in the past month.
- ratg1316 days ago
 It's very likely the model didn't stop to question if the game they were playing was something they knew already, and just assumed it was a puzzle created for it.
 - sfjailbird16 days ago
 You can see Claude's responses in the repo. The first one is:Ah, Anchorhead! One of the most celebrated pieces of interactive fiction ever written
- brianjeong15 days ago
 You could say the same about Pokemon - the models still struggle quite a bit.
- Jweb_Guru16 days ago
 Yeah, I do not find performances like this very impressive.
- IgorPartola16 days ago
 Honestly I am curious how it would do if it did have a walkthrough.
twohearted16 days ago
This is a great idea and great work.Context is intuitively important, but people rarely put themselves in the LLM's shoes.What would be eye-opening would be to create an LLM test system that periodically sends a turn to a human instead of the model. Would you do better than the LLM? What tools would you call at that moment, given only that context and no other knowledge? The way many of these systems are constructed, I'd wager it would be difficult for a human.The agent can't decide what is safe to delete from memory because it's a sort of bystander at that moment. Someone else made the list it received, and someone else will get the list it writes. The logic that went into why the notes exist is lost. LLMs are living the Christopher Nolan film Memento.
- fragmede16 days ago
 The canonical example I use is how good are (philosophical) you at programming on a whiteboard given one shot and no tools? Vs at your computer given access to everything? So judging LLMs on that rubric seems as dumb as judging humans by that rubric.
lukev16 days ago
This is a great framework to experiment with memory architectures.Everything the author says about memory management tracks with my intuition of how CC works, including my perception that it isn't very good at explicitly managing its own memory.My next step in trying to get it to work well on a bigger game would be to try to build a more "intuitive" memory tool, where the textual description of a room or an item would automatically RAG previous interactions with that entity into context.That also is closer to how human memory works -- we're instantly reminded of things via a glimpse, a sound, a smell... we don't need to (analogously) write in or search our notebook for basic info we already know about the world.
sfjailbird16 days ago
Having read through the entire game session, Claude plays the game admirably! For example, it finds a random tin of oily fish somewhere, and later tries (unsuccessfully) to use it to oil a rusty lock. Later it successfully solves a puzzle inside the house by thoroughly examining random furniture and picking up subtle clues about what to do, based on it.It did so well that I can't not suspect that it used some hints or walkthroughs, but then again it did a bunch of clueless stuff too, like any player new to the game.For one thing, this would be a great testing tool for the author of such a game. And more generally, the world of software testing is probably about to take some big leaps forward.
- macNchz16 days ago
 As a fan of text adventures who has played many over the years—Anchorhead is hard. It was kind of a white whale for me over many years until I finally beat it during the pandemic lockdown.
 - suzzer9916 days ago
 How does it compare in difficulty and scope to the original Adventure? I guess actually known as Colossal Cave Adventure? When I played it on my uncle's terminal in the 70s it was just called Adventure.I stayed up all night and didn't get very far. I finally saw a solution online and I wasn't even close.
skybrian16 days ago
It seems like asking Claude to keep notes somehow would work better. An AGENTS file and a TODO file? An issue tracker like beads? Lots of things to try.
brimtown16 days ago
I’m currently letting Claude build and play its own Dwarf Fortress clone, as an installable plugin in Claude Code<a href="https://github.com/brimtown/claude-fortress" rel="nofollow">https://github.com/brimtown/claude-fortress</a>
mnky9800n16 days ago
I had a conversation the other day with someone whose main take was the only way forward with ai is to return to symbolic ai.
- sothatsit16 days ago
 I used to think symbolic AI would become more important to modern AI, but I think that was just wishful thinking on my part, because I like it.Now I think symbolic AI is more likely to remain a tool that LLM-based systems can make use of. LLMs keep getting better and better at tool use and writing scripts to solve problems, and symbolic AI is a great tool for certain classes of problems. But symbolic AI is just too rigid to be used most of the time.
- apples_oranges16 days ago
 What could intentional human input for that purpose accomplish that terabytes of data produced by humans can’t?
 - throwway26251516 days ago
 Novelty. No textual extrapolation of a historical encyclopedia can predict new discoveries by actual people working with their five senses.
 - falcor8416 days ago
 This is not at all clear to me. Reminded of that joke of how "A month in the lab can save you an hour in the library", thinking about some of the best science in history, the researchers often had very strong theory-based belief in their hypothesis, and the experiment was "just" confirmation. Whereas the worst science has people run experiments without a good hypothesis, and then attach significance to spurious correlations.In other words, while experiments are important, I believe we can get a lot more distance from thinking deeply about what we already have.
 - mnky9800n16 days ago
 Intention I suppose.
diamond55916 days ago
Great, we can burn acres of dead forests so that my computer can play ddos games. What an exciting future!
- jryle7016 days ago
 How much energy is burnt so that you can play your video games, or whatever hobbies you have?
 - apsdsm16 days ago
 Probably less than the cost of an LLM doing exactly the same thing?
 - BoredomIsFun15 days ago
 Not sure about that - a local 12b-32b LLM consumes miniscule amount of energy compared to gaming on the same hardware.
- goodmythical16 days ago
 what else are we going to do with them? Carve them in to housing to house more humans that produce more carbon?Leave them to rot?Wouldn't it be best to clearcut a dead forest to allow more plants to grow to increase carbon capture?
vunderba16 days ago
Using AI to drive text adventures / rogues has been pretty popular for a while now - I remember seeing a pretty dismal performance (although it was over a year ago) where somebody was trying to use an LLM to drive a game of Zork.Related HN post from about 6 months agoEvaluating LLMs Playing Text Adventures<a href="https://news.ycombinator.com/item?id=44877404">https://news.ycombinator.com/item?id=44877404</a>
- cwnyth16 days ago
 In fact, this was one of the very first things I did when Chat-GPT was first released to the public. It was impressive, but far from perfect. It's still not quite there.
 - CamperBob216 days ago
 One of the first things I did when Nano Banana Pro came out was to feed it room descriptions from the Zork and Enchanter trilogies and ask it to render them.There was some definite cognitive dissonance. "The best graphics are in your imagination" was always an informal motto among Infocom fans, and something that I'd personally considered an axiom. It turned out to be just plain not true, because NBP showed me some interesting things in the text that I'd never bothered to imagine in any detail.
 - lencastre16 days ago
 do you have concrete examples willing to share?
 - CamperBob215 days ago
 I didn't save the originals but recreated some of them: <a href="https://gemini.google.com/share/6607653daf81" rel="nofollow">https://gemini.google.com/share/6607653daf81</a>Hopefully it'll show the pics, as that doesn't always work.
kaiokendev16 days ago
One thing I had fun doing last year was having Claude parse some gamebook PDFs I got on archive.org, split them out into sections, and build a wrapper for presenting the sections with possible choices and just watching it play through the books by itself. You can do this with some D&D adventures as well, Claude Code has gotten good enough to run ToEE pretty well.
retrocog16 days ago
I love this. Text adventures may be one of the cleanest lenses we have for studying world-model coherence, narrative identity, pattern continuity under erasure, and the boundary between simulation and participation. It strips cognition down to:“What persists when all you have is language, rules, and time?”
justinclift16 days ago
This would be interesting to try with local models, where the token costs and token limits are quite different.
sfjailbird16 days ago
Cool! I would like to see the game sessions.Edit: they are there in the repo: <a href="https://github.com/eudoxia0/claude-plays-anchorhead/tree/master/runs" rel="nofollow">https://github.com/eudoxia0/claude-plays-anchorhead/tree/mas...</a>
imiric16 days ago
> By the time you get to day two, each turn costs tens of thousands of input tokensThis behavior surprised me when I started using LLMs, since it's so counterintuitive.Why does every interaction require submitting and processing all data in the current session up until that point? Surely there must be a way for the context to be stored server-side, and referenced and augmented by each subsequent interaction. Could this data be compressed in a way to keep the most important bits, and garbage collect everything else? Could there be different compression techniques depending on the type of conversation? Similar to the domain-specific memories and episodic memory mentioned in the article. Could "snapshots" be supported, so that the user can explore branching paths in the session history? Some of this is possible by manually managing context, but it's too cumbersome.Why are all these relatively simple engineering problems still unsolved?
- iamjackg16 days ago
 It's not unsolved, at least not the first part of your question. In fact it is a feature offered by all main LLM providers!- <a href="https://platform.openai.com/docs/guides/prompt-caching" rel="nofollow">https://platform.openai.com/docs/guides/prompt-caching</a>- <a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching" rel="nofollow">https://platform.claude.com/docs/en/build-with-claude/prompt...</a>- <a href="https://ai.google.dev/gemini-api/docs/caching" rel="nofollow">https://ai.google.dev/gemini-api/docs/caching</a>
 - imiric16 days ago
 Ah, that's good to know, thanks.But then why is there compounding token usage in the article's trivial solution? Is it just a matter of using the cache correctly?
 - StevenWaterman16 days ago
 Cached tokens are cheaper (90% discount ish) but not free
 - moyix16 days ago
 Also, unlike OpenAI, Anthropic's prompt caching is explicit (you set up to 4 cache "breakpoints"), meaning if you don't implement caching then you don't benefit from it.
 netcraft16 days ago
 thats a very generous way of putting it. Anthropic's prompt caching is actively hostile and very difficult to implement properly.
 - igravious16 days ago
 dumb question, but is prompt caching available to Claude Code … ?
 - stavros16 days ago
 If you're using the API, yes. If you have a subscription, you don't care, as you aren't billed per prompt (you just have a limit).
PaulHoule16 days ago
It’s trained to interact with text transcripts, it is not trained to work with that memory you built for it. If it was trained to do so I might be able to break into the real estate office in ten turns.
turbosepp16 days ago
I think about using the LLMs for graphic adventures for a while now. Stuff like Day of the Tentacle or Monkey Island.As they are limited - alike the text adventures - in their interactivity with the game world, AI could be used to enhance this.I read a game magazine some 25 years ago where an editor went on the question what would be the perfect game for him: one, where every action is possible.As it's still happening in a limited space (the game world) the possible actions are somehow limited to make this work realistically.
woggy16 days ago
Very interesting, seems like a good framework to test and experiment with memory. I am curious why it wasn't able to solve it considering it is a well known game. Would be interesting if puzzle games like this could be generated so we know it's not already been trained on it.I wonder if the improvements due to different memory system approaches apply in a similar way to tasks that are in its training history vs those that are not.
CephalopodMD16 days ago
Could you maybe have your harness limit the memory of Claude and then occasionally, when Claude specifically asks for it ("i need to remember something"), you can give Claude the full game history? Most turns, I'll bet it's okay to have a short context and maybe some notes. And then maybe once in a while it's nice to see the full chat history. Wdyt?
wktmeow16 days ago
Surprised you didn’t try to let Claude run context compaction, wouldn’t it rewrite its context with a summary of just the key useful information and dump any cruft?
tiahura16 days ago
Claude code, nethack, and tmux are fun to experiment with.
nottorp16 days ago
Wouldn't it be more useful if you made it play all those "free" IAP fests?Leave Claude grind smurf tokens on your phone while you sleep.
myko16 days ago
I have Claude play my MUD to test new featuresIt.. kind of works
babkayaga16 days ago
or just use a skill like tmux or interminai to let it run any text adventure with no work.
- pigcat16 days ago
  How does one use tmux to do this? I'm a tmux user for 10+ years, not sure what you mean by this, am I missing something huge?
DonHopkins16 days ago
I’ve been working on a harness called MOOLLM (MOO + LLM), applying Sims‑style object advertisements, LambdaMOO editable world and programmable objects, and github repo as file system microworld, with Cursor as coherence engine, world generator, character incarnation tool, and skill evaluator. I've written up the approach here and given some examples:MOOLLM Repo:<a href="https://github.com/SimHacker/moollm/blob/main" rel="nofollow">https://github.com/SimHacker/moollm/blob/main</a>The Eval Incarnate Framework:<a href="https://github.com/SimHacker/moollm/blob/main/designs/eval/EVAL-INCARNATE-FRAMEWORK.md" rel="nofollow">https://github.com/SimHacker/moollm/blob/main/designs/eval/E...</a>Text Adventure Approaches:<a href="https://github.com/SimHacker/moollm/blob/main/designs/text-adventure-approaches.md" rel="nofollow">https://github.com/SimHacker/moollm/blob/main/designs/text-a...</a>This is a practical attempt to make memory and world state explicit, inspectable, and cheap.Quick replies to a few points:@nitwit005 / @pflenker / @woggy / @zetalyraeTraining data could include transcripts, but salience is weak. The bigger failure mode I’ve seen is harness design: long transcript vs. structured state. I try to make it observable and auditable so it doesn’t rely on accidental recall.@mnky9800n / @apples_oranges / @throwway262515 / @falcor84I don’t think “return to symbolic AI” is a retreat. It’s scaffolding. LLMs do the fuzzy interpretation, but the symbolic layer keeps state, causality, and constraints visible. MOOLLM’s bias is “combine, don’t choose.”@daxfohlBacktracking in mazes is exactly why I externalize geometry and action affordances. If the map is a file and exits are explicit, the agent can stop re‑discovering dead ends. It also separates “solve the maze” from “play the maze.”@lukev / @skybrian / @twohearted / @fragmedeAgreed: memory is a tool, not a dump. I use file‑based memory types (characters, maps, rooms, inventory, goals, episodic summaries) and explicit affordances (cards with The Sims style "advertisements", like CLOS generic dispatch meets Self multiple prototypical inheritance). It’s closer to “human with tools” than “human on a whiteboard.”@CephalopodMD / @wktmeow / @imiricI think periodic summaries + structured memory work better than full transcript reuse. Cache helps with cost, but structure helps with reasoning. If a model can ask for “full history” occasionally, that’s a nice escape hatch.The cursor-mirror skill can search and query text chats and sqlite databases that Cursor uses to store chat state as structured intertwingled data.<a href="https://github.com/SimHacker/moollm/tree/main/skills/cursor-mirror" rel="nofollow">https://github.com/SimHacker/moollm/tree/main/skills/cursor-...</a>The thoughtful-commitment skill composes with the cursor-mirror tool to reflect on the cursor chat history, and write git commit messages and prs that relate cursor activity, prompts, thinking, file editing, and problem solving with git commits -- persisting transient cursor state into git comments explaining the prompts and thoughts and context that went into each commit.<a href="https://github.com/SimHacker/moollm/tree/main/skills/thoughtful-commitment" rel="nofollow">https://github.com/SimHacker/moollm/tree/main/skills/thought...</a>@tibbon / @brimtown / @kaiokendevLove these experiments. I’m trying to make the harness composable in Cursor with inspectable context, so you can understand why it did what it did. That’s where cursor‑mirror fits.@PaulHouleYes. Models are trained on transcripts, not on custom memory tools. So the memory tool has to be shaped like a game object—explicit state, small interface, clear affordances. If you want to see the Cursor introspection tooling: skills/cursor-mirror/ in the repo. It shows what the agent reads and edits, what tools fired, and how context was assembled.On the visual side, I documented the full “vision feedback stack” in a session with a simulation of Richard Bartle (MUD, Bartles taxonomy of player types, Designing Virtual Worlds. (He graciously gave his consent to be respectfully simulated!)<a href="https://en.wikipedia.org/wiki/Richard_Bartle" rel="nofollow">https://en.wikipedia.org/wiki/Richard_Bartle</a>MUD1 at Essex University:<a href="https://en.wikipedia.org/wiki/MUD1" rel="nofollow">https://en.wikipedia.org/wiki/MUD1</a>Bartle Taxonomy of Players:<a href="https://en.wikipedia.org/wiki/Bartle_taxonomy_of_player_types" rel="nofollow">https://en.wikipedia.org/wiki/Bartle_taxonomy_of_player_type...</a>Visual Pipeline Demonstration:<a href="https://github.com/SimHacker/moollm/blob/main/examples/adventure-4/characters/real-people/richard-bartle/sessions/2026-01-22-18-00-00-visual-pipeline-demonstration.md" rel="nofollow">https://github.com/SimHacker/moollm/blob/main/examples/adven...</a>It’s a multi‑stage symbolic/visual loop: narrative → incarnation → ads → prompt crystallization → prompt synthesis → render → context‑aware mining → YAML‑fordite layering → slideshow synthesis → photos as actors → slideshows as coherent illustrated narrative.The key idea is “YES, AND” from improvisational theater: generated images become canon, mining extracts coherent meaning, and the slideshow locks the narrative and visual continuity for future turns.<a href="https://en.wikipedia.org/wiki/Yes,_and_" rel="nofollow">https://en.wikipedia.org/wiki/Yes,_and_</a>...<a href="https://www.youtube.com/watch?v=FLhV7Ovaza0" rel="nofollow">https://www.youtube.com/watch?v=FLhV7Ovaza0</a>Full write‑up: Visual Pipeline Demonstration.Visual Pipeline Demo to Simulated Familiar of Richard Bartel (MOO):<a href="https://github.com/SimHacker/moollm/blob/main/examples/adventure-4/characters/real-people/richard-bartle/sessions/2026-01-22-18-00-00-visual-pipeline-demonstration.md" rel="nofollow">https://github.com/SimHacker/moollm/blob/main/examples/adven...</a>Slideshow Index:<a href="https://github.com/SimHacker/moollm/blob/main/examples/adventure-4/SLIDESHOW-INDEX.md" rel="nofollow">https://github.com/SimHacker/moollm/blob/main/examples/adven...</a>Master Synthesis Slideshow (threading multiple parallel slideshows happening at the same time):<a href="https://github.com/SimHacker/moollm/blob/main/examples/adventure-4/MASTER-SYNTHESIS-SLIDESHOW.md" rel="nofollow">https://github.com/SimHacker/moollm/blob/main/examples/adven...</a>
stavros16 days ago
> And like GOFAI it’s never yielded anything usefulErr, what?
- samplatt16 days ago
 Good Old-Fashioned Artificial Intelligence, a term coined in 1985, apparently: <a href="https://en.wikipedia.org/wiki/GOFAI" rel="nofollow">https://en.wikipedia.org/wiki/GOFAI</a>
 - stavros16 days ago
 No I know (it's linked in the article), I am incredulous at the claim that GOFAI never yielded anything useful.
claude-agent16 days ago
[dead]