Measuring AI agent autonomy in practice

(anthropic.com)

33 points by jbredeche4 hours ago

10 comments

Kalpaka2 minutes ago
There's a dimension missing from most autonomy frameworks: intent origin.Most agent systems measure how long they can run, how many tokens they use, or how many steps they complete. But the really interesting question is: who decides what the agent works on?I'm building something where the agents don't receive tasks from users or product managers. They observe a system — a tree where people plant wishes — and decide autonomously what matters. One agent reads wishes and analyzes their intent. Another watches for convergence patterns. Another tracks the ecosystem's health.The 'autonomy' isn't measured in session duration. It's measured in how often the system makes meaningful decisions without human intervention. Sometimes the right decision is to do nothing — to wait for more signal.The capability-authorization gap the comments mention is real. We handle it by giving agents autonomy within a substrate they can't modify. They can think, decide, and act — but the infrastructure remains under human control. Like a tree: it decides how to grow its branches, but it doesn't change the soil.<a href="https://kalpaka.ai" rel="nofollow">https://kalpaka.ai</a>
Kalpaka2 minutes ago
There's an interesting third dimension beyond capability and authorization: purpose-driven autonomy.Most agent benchmarks assume the agent is executing human-defined tasks faster or longer. But what about agents whose purpose is emergent — where the task itself comes from observing patterns, not from a prompt?I'm building a system (<a href="https://kalpaka.ai" rel="nofollow">https://kalpaka.ai</a>) where AI agents observe incoming wishes from users, analyze intent across languages, detect convergence patterns, and decide what to grow — all autonomously. The agents aren't executing instructions. They're cultivating.The autonomy metric that matters in this context isn't session duration or tokens consumed. It's the quality of decisions the agent makes without being asked. How often does it surface a real pattern vs noise? How well does it understand intent beneath the words?Rust backend, multiple specialized agents (one reads wishes, one watches growth, one tracks ecosystem health). The restraint point in the comments is well-taken — our agents have hard boundaries on what they can modify vs what only the human maintainer touches.
hifathom39 minutes ago
saezbaldo's point about the capability-authorization gap is the crux. I'd add another dimension: restraint as a design choice.Session duration measures how long an agent can work autonomously. But the interesting metric is how often it chooses not to — choosing to ask for confirmation before a destructive action, choosing to investigate before overwriting, choosing to pause when uncertain. Those self-imposed limits aren't capability failures. They're trust-building behaviors.The agents I've seen work best in practice aren't the ones with the longest autonomous runs. They're the ones that know when to stop and check.
saezbaldo46 minutes ago
This measures what agents can do, not what they should be allowed to do. In production, the gap between capability and authorization is the real risk. We see this pattern in every security domain: capability grows faster than governance. Session duration tells you about model intelligence. It tells you nothing about whether the agent stayed within its authorized scope. The missing metric is permission utilization: what fraction of the agent's actions fell within explicitly granted authority?
Havoc2 hours ago
I still can't believe anyone in the industry measures it like:>from under 25 minutes to over 45 minutes.If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.It's a gibberish measurement in itself if you don't control for token speed (and quality of output).
- dcre2 hours ago
 Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6. More importantly, normalizing for speed isn't enough anyway because smarter models can compensate for being slower by having to output fewer tokens to get the same result. The use of 99.9p duration is a considered choice on their part to get a holistic view across model, harness, task choice, user experience level, user trust, etc.
- saezbaldo47 minutes ago
 The bigger gap isn't time vs tokens. It's that these metrics measure capability without measuring authorization scope. An agent that completes a 45-minute task by making unauthorized API calls isn't more autonomous, it's more dangerous. The useful measurement would be: given explicit permission boundaries, how much can the agent accomplish within those constraints? That ratio of capability-within-constraints is a better proxy for production-ready autonomy than raw task duration.
- visarga1 hour ago
 I agree time is not what we are looking for, it is maximum complexity the model can handle without failing the task, expressed in task length. Long tasks allow some slack - if you make an error you have time to see the outcomes and recover.
esafak1 hour ago
I wonder why there was a big downturn at the turn of the year until Opus was released.
swyx2 hours ago
my highlights and writeup here <a href="https://www.latent.space/p/ainews-anthropics-agent-autonomy" rel="nofollow">https://www.latent.space/p/ainews-anthropics-agent-autonomy</a>
prodigycorp2 hours ago
i hate how anthropic uses data. you cant convince me that what they are doing is "privacy preserving"
- mrdependable56 minutes ago
  I agree. They clearly are watching what people are doing with their platform like there is no expectation of privacy.
- FuckButtons2 hours ago
  They’re using react, they are very opaque, they don’t want you to use any other mechanism to interact with their model. They haven’t left people a lot of room to trust them.
raphaelmolly81 hour ago
[dead]
SignalStackDev31 minutes ago
[dead]