Agentic coding notes from Galapagos Island

(danluu.com)

132 points by gm6789 hours ago

15 comments

duckmysick4 hours ago
I'd like to highlight a different part of the article:> In general, when I talk to software folks about testing, I'm coming from such a different place that they immediately look at me like I'm an alien, so let's talk about how we tested at this hardware company I worked for, Centaur, which informs my biases about how I like to work. Some of the things that we did that were or are unorthodox in the software world are:> Hired dedicated QA / test engineers, with testing being a first-class career path on par with being a developer - No code review by default - Virtually no hand-written tests - Constant testing via what programmers sometimes called property based testing, randomized testing, fuzzing, etc., although we just called those tests (hand-written tests were called "hand tests"). - Large regeression test suite (3 months wall clock to execute on compute farm) - No unit testsAnybody here tried that (or a similar) approach? Especially going all-in on property based testing and fuzzing with no unit tests.I tried that approach somewhere before and the initial results were promising, but ran into political issues so the idea was canned.
- kordlessagain1 hour ago
 I bake MCP tools into everything now, including doing screenshots. Any LLM can run just about every function, including resizing the window. I just watched Fabel 5 do a full usability test on a new project, copying the release cycle for my agentic terminal, relaunching the app like 20 times as it went, ensuring the move to a built and signed release was working. It installed the program like 5 times (something I do daily multiple times).I noticed map tiles were not working and started to tell it, but then all of a sudden they reappeared and checking the logs it had found the issue and autocorrected itself.The key here is feedback loops and systems annealing.As for Dan, my God I love this guy. Glad someone posted it!
 - disgruntledphd21 hour ago
 I feel like Dan is one of the most consistently interesting writers in tech at the moment.This is most likely because we take similar approaches towards things.
 - m_m_carvalho20 minutes ago
 [flagged]
- rtpg1 hour ago
 I really wonder what "randomixed testing" looks like in practice. What is the measure of success/failure?I undrestand for fuzzing you have a very basic "doesn't crash" metric. Property based tests.... you gotta write properties for the PBTs to work on. What is the randomized testing hitting?
 - kordlessagain1 hour ago
 Dogfood everything, all the time.
- bitwize3 hours ago
 The first thing I started wondering was "is this the same Centaur that comes up as 'CentaurHauls' on a CPUID (EAX=0)?"
 - imrehg2 hours ago
 Should be, judging by their LinkedIn from the bottom of the page (Centaur Technology - Member of Technical Staff: 2005 - 2013)This is a blast from the past. Centaur was doing VIA Technology's CPUs, and I was at VIA (in Taiwan) while this author was in Centaur (in the US). I was on the embedded side, but I remember some distinct collaborations with the US team, so there's a non-zero chance to have crossed path.
martey8 hours ago
OP's alt text makes it clear that by "Galapagos Island" they mean Vancouver. I assumed that this was some sort of local nickname, but all of the references to "Galápagos of Canada" I could find are talking about Haida Gwaii instead.
- dan-robertson3 hours ago
 I think it’s just a metaphor for being isolated from the wider tech community.
- pluc6 hours ago
 Nobody calls it that.
- skrebbel5 hours ago
 Yeah I was like “woa a PGConf on the galapagos, i gotta get my ass to one if those!”
 - progbits5 hours ago
 Realizing he's just using it to mean remote place in terms of AI bubble (Vancouver! What does that make all the other places that are not major tech hubs?) was a bummer.Who cares about AI, I wanted to read about living in Galapagos
 - mattlondon2 hours ago
 I visited a decade or so ago. Most of the islands you are forbidden from staying overnight on. No running water. No power. No phone signal.You're going to need a big big solar panel to run local inference during the day, but luckily there is plenty of sun!
 - fragmede35 minutes ago
 Looks like there is Starlink though! <a href="https://dplnews.com/cnt-ecuador-y-starlink-conectan-las-4-islas-de-galapagos/" rel="nofollow">https://dplnews.com/cnt-ecuador-y-starlink-conectan-las-4-is...</a>
 - MomsAVoxell3 hours ago
 I'm saddened to have gotten this far and had my bubble burst, I legitimately thought someone was on the real GI's and was able to compose such a treatise, nevertheless.
bob10298 hours ago
A lot of the crazy ideas seem to have melted away in the face of massive context sizes. Today, I can put roughly a megabyte of utf8 text into my system prompt before things start to get weird.That is a massive amount of information even if we are being sloppy with it. You can read The Hobbit and the first Harry Potter book cover-to-cover and still have room to spare. I would deeply struggle to develop a world model this detailed for any business. Anything that needs to get more specific than these narratives can be a SQL query tool into the data warehouse, grep over the codebase, MS graph API lookup, etc.Giving the business a balanced way to collaborate over this one shared model of the world is a new challenge I am beginning to engage with. I've also noticed that the world model will compound on itself in terms of self-detection of update opportunities. The more constraints there are, the more likely we appear to violate one.
- stingraycharles7 hours ago
 You’re forgetting that keeping the context small is better economically and delivers better results.
 - bob10292 hours ago
 I learned some nuance. Small context is important if the context isn't cachable. If most of it is the same prefix every time, the situation changes pretty significantly.
 - themgt5 hours ago
 Forgetting? I think you mean to say your advice was auto-compacted to keep our context small and deliver better results.
- embedding-shape5 hours ago
 > melted away in the face of massive context sizesIf only. There is a huge difference between "Gives good responses/can easily spot things within N context size" and "Technically works but sucks within N context size", almost all models basically become cave-people once you go beyond 50% of the "supported" context size, meaning while they may technically work with 1 million output tokens, those last 500K tokens are gonna be massively "dumber" than the first 500k tokens.
 - stalfie1 hour ago
 There's at least one benchmark that attempts to measure this, but it has been running for a year plus so it's quite infrequently updated now.<a href="https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87" rel="nofollow">https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...</a>
 - fooster2 hours ago
 I don’t find that to be true?
nasretdinov1 hour ago
I can agree with Dan on two things: LLMs do often produce incorrect results and that it's still useful for productivity when used in moderation. For me the wrong results actually cause some kind of ragebait response so I become much more motivated to learn more about the subject to actually generate correct response. After I've learnt the subject area enough I find I'm better off having LLM review my code instead of writing it.I haven't even begun to try to comprehend how to use fuzzing testing to improve the ability to find bugs, but it sounds really interesting. I've seen mutation testing to be very useful for finding gaps in tests, so I can only imagine that fuzzing + LLMs might produce insane results.
cognitiveinline2 hours ago
It's really coming down to "Do we want to subscribe to a human with a salary of many ~$10000(s) ", or "Do we spend 100(s)$ on an AI subscription"Even with it's issues, the latest models are going to disrupt the labor economics.
- layer82 hours ago
 On the other hand, you have many more humans to choose from than models, and they don’t change their character every few months.
 - cognitiveinline1 hour ago
 There's a lot people will do for equal quality and faster work at a 100th the cost. Like being OK with character changing and adapting to it. You are not seeing this in your workplace?
gwern7 hours ago
URL typo: "hange how he works](/productivity-velocity/)". (I make this kind of Markdown syntax error all the time and set up a lint for '](/'.)You should talk to <a href="https://www.mechanize.work/" rel="nofollow">https://www.mechanize.work/</a> for sponsorship/credits and about environments.
zapnuk4 hours ago
There is a reasone we use left and right margin/padding.This blog is quite unreadable for 27/32" monitors.
- ornornor6 minutes ago
 The point of a wide monitor for me is to have two 720 wide windows side by side, not a single gigantic 1440 window with impossibly long lines.
- Aldipower4 hours ago
 Toggle to "reader view" or resize your window. It is up to you and really isn't that hard.
- layer82 hours ago
 There’s a reason not to maximize your browser windows. How do you handle HN threads on that monitor?
joshka1 hour ago
TW;DR (too wide didn't read) ;)
brcmthrowaway9 hours ago
This seems like the beginnings of AI psychosis, tbh.
- mock-possum5 hours ago
  How do you figure?
  - MomsAVoxell3 hours ago
    The immensely desperate search for meaning in the mundane, perhaps?
TokenLens30 minutes ago
[flagged]
foobarbecue8 hours ago
It's "Galapagos" or "Galápagos," not "Galapogos."
- MomsAVoxell3 hours ago
  He's not referring to the real islands, but rather the state of mind imposed upon existence on Vancouver Island.
- stingraycharles8 hours ago
  You miswrote OP’s miswriting in the third version :)
  - foobarbecue8 hours ago
    Argh, autocorrect got me. Thanks, fixed.
    - foobarbecue8 hours ago
      Aaand now OP has fixed it in the HN post title. Still wrong in the linked article.
zuzululu9 hours ago
[dead]
nobodycares18 hours ago
[flagged]
- anon77258 hours ago
  You’re out of your element, Donny.
zarzavat8 hours ago
Fable changes the game yet again, because it's API-only.You're not likely to want to run Fable in a loop any more than you want to take a bunch of dollar bills and light them on fire. Every invocation of Fable has to be intentional, its context carefully managed. I feel like a babysitter.
- weird-eye-issue8 hours ago
 Compared to Opus 4.8 I really haven't been impressed
 - danielbln7 hours ago
 And I've been quite impressed. Opus talks the talk, Fable walks the walk.
 - stingraycharles7 hours ago
 I don’t understand what these comments add to the discussion, you always see these and it’s just noise at this point.
 - zarzavat17 minutes ago
 Every metric becomes a target (Goodhart's law). Also, the plural of anecdote is not data.The subjective anecdotes from HN users matter because they are not data and are much harder to game. Not impossible to game, always be aware of users with low karma, but more difficult than gaming a benchmark.
 - danielbln7 hours ago
 They add nothing, meaningless anecdotes. I was kind of riffing on that.
 skrebbel5 hours ago
 Posting meaningless comments isn’t suddenly useful because you’re aware of it and “riffing on that”.
 danielbln5 hours ago
 And now you've added to the pile, congratulations.
 weird-eye-issue4 hours ago
 So you added BS, on purpose? There is nothing meaningless about anecdotes. The plural is data
 - MomsAVoxell3 hours ago
 Yeah come on, if these endless dialectic statements get made, at least post more detailed information about how your position was attained.
 - wwind1235 hours ago
 I make Claude, Codex and Gemini review each other's design plan and implementation. Each always found a lot of things the others missed...until Fable 5 came out. Whatever plan or code Fable 5 comes up with, now it's very hard for Codex and Gemini to find any serious hole in it.
 - HappySweeney3 hours ago
 In my experience, audit findings decrease in frequency as a codebase matures. Was Fable doing greenfield work?
 - phplovesong7 hours ago
 Damn that was a cringe comment
- vidarh3 hours ago
 I just had Fable run overnight in a loop, and it fixed ~150 compiler crashing bugs that Opus had kept deferring.I wouldn't start with Fable - when I use burndown loops I tend to include instructions to document progress and set aside anything that turns out to be harder than expected, and solve the easy stuff first. When a model runs out of easy stuff and start struggling to make progress on what is left, I can let it keep churning on that - they get there eventually - or I can bump it up to a smarter model if one is available.Opus had churned a week driving down spec failures, and did a great job. The 150 Fable took overnight were the ones Opus had kept putting aside.
- jacobgold6 hours ago
 Fable is supposed to return to subscription plans, unless I'm missing something: <a href="https://jacob.gold/posts/fable-5-removal-is-temporary/" rel="nofollow">https://jacob.gold/posts/fable-5-removal-is-temporary/</a><pre><code> Anthropic says the change is about capacity and is temporary. In its launch announcement on June 9, 2026, it says: "After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can."</code></pre>
 - zarzavat3 hours ago
 Apple claims their price increases are temporary too. I'll believe it when I see it. The capacity limitations are not going away any time soon.
 - techpression5 hours ago
 They can’t keep their current models working on subscriptions[1], so we’ll see if this is marketing or not in the future. It’s smart to tease it no matter what, ”insert classic first hit is free drug reference”.[1] <a href="https://status.claude.com" rel="nofollow">https://status.claude.com</a>
- NitpickLawyer8 hours ago
 I agree with you that you don't need fable for everything, and you have to be careful on what you run it on. CRUD stuff, sure even the small models can do it. But there certainly are tasks that are very much suited for the absolute SotA and you'd leave money on the table by not using it. And how much a task is worth is dependant on how much it improves your bottom line. So the cost/token becomes largely irrelevant.Let's take this [1] benchmark. A bit more context here [2].Here models are asked to create kernels for running inference on models. This is a benchmark perfectly suited and highly relevant right now. It's easily verifiable, an active are of research, and the results are immediately useful.Say you have 1 unit of compute, it costs 300k $ and serves 1x users. In comes Fable and after one session it gives you 30% speed-up on your 1 unit of compute. It can now serve 1.3x users. How much is that one session worth for you? How much is it worth for a company using 10 units? 100 units? How much is it worth for a hyper-scaler running 10.000 units? How much is it worth for a lab that trains the next frontier model and then serves it from 100.000 units? 30% is relative. And the cost for one session is really meaningless. It can cost 1m$ / session and it would still be worth it for someone.[1] - <a href="https://kernelbench.com/mega" rel="nofollow">https://kernelbench.com/mega</a>[2] - <a href="https://x.com/elliotarledge/status/2072814573753975266" rel="nofollow">https://x.com/elliotarledge/status/2072814573753975266</a>
- eru8 hours ago
 For now, I can use Fable from the web just fine.> You're not likely to want to run Fable in a loop any more than you want to take a bunch of dollar bills and light them on fire. Every invocation of Fable has to be intentional, its context carefully managed.Eh, that's just because it's the current frontier model. Give it a few weeks, and prices will drop.
 - zarzavat8 hours ago
 API prices are the new normal. I doubt that prices will drop to the level of the subsidized subscriptions any time soon. Usage is growing exponentially but capacity cannot. There is no reason for them to waste their capacity on subscription users if they can sell that same capacity to API users.Like with Uber and Lyft, the low prices were a fight for market share, but now they have successfully captured that market share the focus changes to balancing their books.
 - eru3 hours ago
 I suspect subscriptions will stay. But you will see more and more roadblocks that the 'barely goes to the gym' user barely notices, but that the power user will chafe under.I make that prediction, because the people who pay for subscriptions but only use them moderately at best are truly profitable.
- nobodycares18 hours ago
 [flagged]