LLMs are still next token predictors, just because you can give it more vague instructions and it still finds the right steps to follow, it doesn't mean it's intelligent. It means you're speaking the same language as the harness they trained your model on.<p>And that has a limit. If you are stuck at PoC level or simple apps, you have no idea how limited the current models still are. There you really need to break tasks down, not just trust a token predictor to list steps that sound good. There has to be a human in the loop somewhere, because by the time you start skipping permissions, best case you get the jackpot, more likely is you get a suboptimal solution and token waste and what's genuinely still terrifying when the model ignores instructions and does some stupid nonsense, ruining your day. It really is as sharp as a CNC machine. It's not not useful, but could be dangerous, so maybe don't try to carve wood with a monster machine, or park your Ferrari in that crammed neighbourhood if you don't know how to parallel park.
I thought this was how everyone who can actually code uses AI for anything that’s actually important.<p>Am I wrong? Are you guys just YOLOing everything these days?
>>You never use “YOLO” mode (aka “dangerously skip permissions”)<p>Do you mean this?<p>I'm curious how are people using Claude in any way other than bypass-permissions. I've tried for so long to maintain a curated list of things Claude can use, but inevitably I would always come back only to find it stuck because it decided to pipe an output of one tool into another and that's not explicitly allowed so it stopped even though it was just greping or whatever. I found it infuriating. In bypass-permissions it "just works" but then again I only use it to analyze existing code and suggest new changes(and even if it breaks something that's what source control is for?)
It does do this to frustrate you, save 30 tokens, and then waste a few thousand more when it didn't get all the context it needed by grep'ping. You have to be involved in the process though. It frequently wants to do things that are so incorrect, that even if it would be more convenient to just totally ignore it, it would be insane to actually ignore it. Do you trust it to not accidentally rm -rf the .git/ right after it helpfully force pushes to remote? I don't. Even if I don't expect it to do that, why would I ALLOW it to be able to?
I’ve found unexpected success in using ephemeral NixOS VMs for local development… once you authenticate your agent you can let it run wild without worrying about permissions.
It seems to me that this “short leash” you keep could just be your way of keeping engaged. It’s certainly not needed if you use good models, but I can appreciate it if it helps you to keep paying attention.<p>The biggest failure mode of AI use is stopping engaging with problems because “the AI is handling it”. As long as you avoid that, I think you’ll be fine.<p>That said, I much prefer having detailed discussions about a feature or idea, letting the AI off the leash to implement it, and then coming back to have a detailed review discussion. This seems to get a lot more out of better models that can have more nuanced discussions and write better implementations. The process of discussing designs and their implementations, questioning things that look weird to me, and actually reading the AI’s responses also helps me to get to better final solutions.
I find it hard to stay engaged doing this. I do get good results, but it's just hard to not get distracted when it's doing the work.
This post seems like some decent advice mixed in with a lot of overconfidence and unverifiable claims.<p>“expert developers whose skills have reached the point where they outclass any and all “frontier AI models” in their area of expertise”<p>Are any developers saying they outclass any and all frontier models? I’d say at best it’s mixed at this point. The best developers still do certain things better, but not even close to all things.<p>“The problem is that even code written and/or reviewed by Fable 5, will stink”<p>I’m skeptical. Example prompt and output please.
I <3 how everyone and their brother feels qualified to write advice to hundreds? thousands? of other developers about AI ... based on a couple months of experience as a personal user.<p>I mean, it's like writing a book about how to use React or Django or some other major software ... after you used it for one project for a month!<p>Authors: I know this is the Internet, and I know bloggers blog about whatever pops into their head ... but if you are going to act like an authority, how about you <i>learn more than the average reader</i> before you start telling them authoritatively what to do?
There are a lot of people with a long career in the old way of doing things are feeling incredibly threatened and defensive and desperate to virtue signal about AI.
People are doing what they've always done with any other new technology, and sharing what, personally, works for them. People can take or leave the advice.
Right but there's a marked difference between a "I just tried this new tech and here's what I think" vs. "I've used this tech for a few months and now I'm going to speak like I know everything about it".<p>I have no beef with people writing about new tech, but I do have beef with claiming that "____ is the correct way to do it" ... based on nothing except "I feel proud of the last three months I spent with Claude".
It's an open problem of clearly large value how to get reliably useful and trustworthy outcomes from AI systems in many domains, software is maybe the signal example of that. If one had solved it resoundingly and scaleably, one could in fact "get rich quick".<p>It is unsurprising that a lot of people claim to know how to get rich quick.<p>I believe it is possible to solve this problem, and I have my own horses in the race which I won't threadjack to promote here, but it's the central problem of our profession at the moment. We've all seen the truly discontinuous outcomes and we've all seen allegedly national security dangerous models (which at one time was GPT-3) faceplant with it's shoelaces tied together. I wanted to see if Fable was really all that and I left it overnight on some fairly straightforward C++ (code DSv4 Flash works on with moderate supervision) and it's pretty roast worthy, I gave it a chance to redeem itself this morning and it's ticked up a bit (I still think it's roughly Opus 4.8 with a Project Zero fine tune and DRO trained off the constant gratuitous yield tic which is pretty clearly an intentional gimp).<p>I give all such claims 30 seconds of my time because someone is going to actually be right one of these days.
Seems hella inefficient.<p>Better method start to realizing that everything that every program do is data transformations and or movement<p>Then you ask llm to subdivide data in a tree along the domain model, classifing streaming vs storing nodes<p>Then for each node you discuss with the ai for the best data structure<p>Then you ask for an interface that fully encapsulate the structure and every mutation only allows to go from a valid state to a valid state and bidding else is allowed to touch the state<p>And that's mostly it just connect all the interfaces until input goes to monitor or to storage or to api or wherever the destination is
I'm curious whether Opus4.8 or similar can attain Mythos level through good system prompting and steering? You would expect this to work if it's true that the strength of Mythos is its unwillingness to quit before it gets a desired outcome
As a Mythos user (I’m part of Project Glasswing), I would say that abliterated models [1] produce similar, if not identical, results. While good prompting and steering won’t give Claude Opus 4.8 the same capabilities as Mythos (preview 1), using abliterated models (if you have the computational power to run the larger ones) will get you close to the same goals as people who have access to Mythos (preview 1) [2].<p>[1] <a href="https://huggingface.co/search/full-text?q=abliterated&type=model" rel="nofollow">https://huggingface.co/search/full-text?q=abliterated&type=m...</a><p>[2] I specifically refer to “preview 1” because the newer versions (Fable 5 / Mythos 5) don’t appear to offer the same level of freedom as the very first version that I was able to use through Project Glasswing. This is one of the reasons why I continue running our massive security scans with “preview 1”, or at least I was running them until June 30, when the program’s policy changed.
I think that Anthropic is gaslighting us with their new model releases. Specifically, I think they have some good base model and are just fine-tuning it until they achieve desired outcome, or the desired outcome is achieved accidentally as part of fine-tuning. My theory is based on the fact that as a long-term (if you can call it that way) Claude user I keep noticing the same patterns it outputs. It's not trivial but certainly possible to see when something has been written by Claude because it has a different style than GPT.<p>However they have quite good harness in their backend which is the actual model.
There really wasn't much substance to this article.
It’s just parroting the current trope.<p>Last year it was, “AI is just a stochastic parrot.”<p>This year it’s, “AI can write the code, but a human still has to review it!” (Using AI, of course.)<p>Give it another year and the narrative will be: “Only AI is capable of reviewing code, and only AI can review the AI’s review. Humans just need to read the AI’s final opinion so they still have meaningful oversight.”<p>The goalposts keep moving. The certainty never does.
The regress ends somewhere, because (barring some pretty sharp changes to the way the law works basically everywhere) ultimately someone has to certify the outcomes as acceptable. This might be in the form of the market (though AI-adjacent stuff seems extremely prone to prolonged market failures), this might be regulatory in nature. This might be the executive management of the companies involved.<p>Personally I think that if you cranked the capability up high enough the first person you'd run into who absolutely demanded more than vibes and didn't care about your singularity thesis would be the representative of a reinsurance firm: mostly to do serious stuff without bending the law, you need insurance, and I am unaware of anyone writing serious policies (certainly not ones that make any economic sense) that underwrite the risk of AI autonomy outcomes financially.<p>When Swiss Re writes a policy that Anthropic Cinematic Universe or whatever iteration we're on won't fuck it up?<p>Now maybe we're talking. Until then you ask three practitioners and get nine answers, no one knows what they're talking about unless they're doing a really good job keeping it quiet (and that's probably what you'd do!).
This is probably slower than writing the code yourself. Doesn't make sense to me. Using an agent without YOLO mode is not wort it.<p>The way I rather do it is tightly control the output by skills written yourself, prompts, plans, etc. and have the closest possible outcome you would write yourself.
[flagged]