21 comments

  • bohdanstefaniuk3 hours ago
    I would like to see not only token-saving in regular chats but also in codding sessions. Also it would be nice to have some kind of revenue attribution. For example we have a project which earns X amount of money and I would like to see if our token usage actually help us moving our revenue. Or at least what part of our revenue tokens cost
  • lavaman1315 days ago
    This is an interesting approach. I think people still underestimate how quickly token limits and context bloat become the bottleneck when you start running agents in production loops. Seems very relevant topic also given the environment. Will check it out
    • akh4 days ago
      Thanks - indeed, it's also difficult to estimate how much the agents will cost in advance, and how much changes could reduce or increase costs.
  • yohji198417 hours ago
    I'm wondering why all these token-saving solutions focus their benchmarks exclusively on simple Q&A tasks. If their tools truly saved money in real, long-term programming tasks, they would have definitely published those benchmark results instead of just Q&A tests, especially since a simple code editing benchmark with a hidden eval harness is very easy to design. Personally, asking a coding agent questions without any code editing is a very rare case for me
    • glenngillen11 hours ago
      I did exactly that and it's all covered in the blog post. There's no hidden eval harness, it's in the same codebase as the CLI so others can reproduce and/or extend as they see fit. It also includes code editing tasks and measures them too. The only asterisk on the code editing is I didn't automate the reporting of accuracy because the test only uses Claude and having it judge it's own work seemed dubious, and having our existing parsers + policy checks verify Claude's output in a benchmark test like this might look like we were cooking the books in our favor (i.e., we're testing and verifying using our own system which obviously we will always get 100% on). Writing up a whole new independent Terraform parser or test harness to verify the results was beyond the scope of what I was willing to do for this just right now. So I opted for a "just assume Claude always gets it right", and we reported on just the token differences to get there.
      • yohji19848 hours ago
        Sorry, I missed the Open Items section. You're right about that, designing a good eval harness can be difficult and expensive. Maybe we need some kind of community project for agentic evals, where people can share eval harnesses and run logs.
  • photonair4 days ago
    Tokens can get expended very quickly driving costs up substantially since there is no clarity in its usage If your service can cut the use substantially, for a large company with a huge AI bill, the savings could make up for the cost of the service which I think is pricey in my opinion. But I can see how it could work out.
    • akh4 days ago
      yep, the ROI for cloud and AI spend tools is fairly easy to measure: it has to save multiples of what it costs.
  • 57016524005 days ago
    I don't know how they can justify 250 USD / month bill. let alone 1000 USD / month.
    • akh5 days ago
      We prevent way more than that from being added to the cloud bill by showing engineers cost estimates that enables them to make better decisions pre-deploy - e.g. when an engineer knows the IOPS option on their EC2 instance is costing them a lot, they're more likely to reduce that or not use that in dev envs vs just copy/paste what's on production. There's an ROI report on infracost.io that shows how we measure the cost prevention between the first and last commit on merged PRs.
    • hx85 days ago
      Seems to be targeted at quickly reducing infa cost for small-human teams with high-compute costs. I can see some value, but it&#x27;s something I&#x27;d want to review quarterly instead of per-commit. I might feel different if I was really trying to stretch some runway.<p>I can see why YC is interested in this issue, as I&#x27;m sure lots of startups are trying to stretch that runway.
      • hkh5 days ago
        When we started, we thought everyone could use it from startups to medium sized companies. What we learnt is that the most value comes for the enterprises. The reason is they have used Terraform to decentralize the infra provisioning, so now instead of a central platform team making all the IaC changes, you have hundreds or thousands of engineers making changes every month.<p>Each of them are making a lot of decisions on the infra. and that combines with the crazy pricing models from the cloud providers was saving companies a lot of money.<p>Then, we saw how much time is saved when you catch it at this point vs after the fact. Basically avoiding a bunch of tech debt
        • hx85 days ago
          If you&#x27;re primarily serving enterprise then the $250&#x2F;mo foot in the door price makes sense. No reason to make too aggressive of a play for the small market&#x2F;mid market.
  • tuo-lei20 hours ago
    how are you handling errors? when an agent gets a flag wrong, cli help text is usually massive. could eat a lot of the savings on retries.
    • glenngillen10 hours ago
      It&#x27;s been a lot of trial &amp; error. A quick aside: running these tests&#x2F;evals&#x2F;call them what you will at scale has been fascinating to me. Going back and trawling through the logs has been like speed-running through hundreds of usability tests with people, full of the same types of &quot;aha! Of course you&#x27;d try and do that, why didn&#x27;t I think of that already?&quot; moments of insight and inspiration.<p>Which is also how we&#x27;ve gone about working out how to improve the CLI. It&#x27;s usually one or more of:<p>* rethinking the subcommands and hierarchy to something more obvious and aligned to the task<p>* providing clear documentation upfront (i.e, in the skills file)<p>* keeping help text concise, but not too concise. You can&#x27;t assume the reader is already a power user and it&#x27;s simply looking for a reminder&#x2F;reference. So include usage examples for common use cases<p>* where possible on errors, suggest the likely commands the person meant.<p>* In general offer affordances on what likely next steps will be. This goes for help output, success, and errors.<p>&gt; cli help text is usually massive<p>That doesn&#x27;t have to be true.<p>&gt; could eat a lot of the savings on retries<p>This doesn&#x27;t have to be true either. You don&#x27;t need to give the same full help output on every single error, once they&#x27;ve got it once they&#x27;ve got it. Also the size of the entire help output for most CLIs is generally insignificant compared to even just a couple of source files in most repos.
  • 57016524005 days ago
    why would anyone need 10,000 runs a month? do people modify their infrastructure 10,000 times a month?
    • akh5 days ago
      CI&#x2F;CD pipelines needs 1 CLI run per commit (like any other code scanning tool), we regularly see enterprises with 100K+ runs&#x2F;month.
    • esafak5 days ago
      Ephemeral environments
  • AMILLI_AI_CORP13 hours ago
    AMilliPay.com solves the Ai agent economy.
  • zuzululu5 days ago
    Not really seeing the point I just use openrouter if I&#x27;m penny pinching
    • akh5 days ago
      OpenRouter is great for keeping your LLM API bill down, Infracost is about the AWS&#x2F;Azure&#x2F;GCP bill your IaC creates. When an agent writes IaC that creates a NAT gateway or an RDS instance, that&#x27;s $50-5000&#x2F;mo in cloud spend, so the agent knowing that estimate and the best practices as it&#x27;s generating the code can optimize it pre-deploy.
  • shakeelhussain51 hour ago
    [flagged]
  • smy-hn3 hours ago
    [dead]
  • winphoto5 days ago
    [flagged]
  • songting5915 days ago
    [flagged]
  • sid_ships5 days ago
    [flagged]
  • sanreds5 days ago
    [dead]
  • zane_shu5 days ago
    The useful split here seems to be: let the CLI do price lookup and validation, and let the agent decide which diff to make. The thing I’d watch is how visible the source of the estimate is in review — if a PR says “saved $X”, reviewers need to see which prices&#x2F;rules produced that number.
    • glenngillen5 days ago
      Exactly! The estimates + cost diffs are expandable in the PR so you can see the working.
  • jing099285 days ago
    The interesting bit is making cloud cost a first-class constraint for the agent loop, not just a post-hoc report. I&#x27;d be curious how you handle confidence&#x2F;uncertainty in estimates, since a wrong cheap-looking recommendation can be worse than no estimate in infra PRs.
    • glenngillen5 days ago
      We&#x27;ve a lot of experience doing this! Also while this feeds into and supports LLMs and non-deterministic systems, our recommendations are entirely deterministic. So it&#x27;s pretty rare to have a &quot;wrong&quot; recommendation given they&#x27;ve essentially been implemented + reviewed by actual people.<p>What can definitely happen though is you get one that is inappropriate in a given context. An example here might be a recommendation from an m5.2xlarge to an m6g.2xlarge instance. Same vCPUs and memory, lower cost, but... also a switch from Intel -&gt; ARM architectures. For a lot of companies their build pipelines make it easy enough to make that change. For others there may be some specific dependency on Intel for that workload which means changing the architecture isn&#x27;t viable. In that case you can simply dismiss the recommendation and we&#x27;ll stop suggesting it.
  • arxiv1014 days ago
    [flagged]
  • ismrhao5 days ago
    [flagged]
  • eugeneonai6 days ago
    The 79% &#x2F; 67% reduction generalizes broader than IaC. Any CLI agents shell out to (curl, jq, grep, kubectl, gh, psql) burns the same token tax — verbose JSON, free-form text output, agent-composed pipelines. A predicate-flag + compact-output redesign would land on all of those.<p><pre><code> Direct answer to your question: agents-writing-IaC-in-prod is rare today but not zero. I see more &quot;agent reviews the IaC PR a human wrote&quot;, which Cost.dev sounds well-suited to since verification runs locally and the agent only consumes the result. Even if the prod-IaC path takes another year, the design pattern earns its keep on every agent-shellout you already do. One question: does the CLI surface its cache state to the agent, or does each invocation start fresh Repeated price-fetches across a single agent run would be the obvious next-tier savings.</code></pre>
    • glenngillen5 days ago
      We do cache the results locally so that we&#x27;re not repeatedly hitting our pricing API. The LLM doesn&#x27;t access that cache directly though as it&#x27;d suffer the token tax you mention. Instead we optimised our CLI to return agent optimised results. We&#x27;re constantly iterating and improving on it, but it already reduces the tokens usage very significantly. I wrote about it here: <a href="https:&#x2F;&#x2F;www.infracost.io&#x2F;resources&#x2F;blog&#x2F;we-cut-claude-s-token-usage-79-by-redesigning-our-cli-for-agents">https:&#x2F;&#x2F;www.infracost.io&#x2F;resources&#x2F;blog&#x2F;we-cut-claude-s-toke...</a><p>We&#x27;ve found even more improvements since that post so those will be shipping soon too.
      • eugeneonai5 days ago
        Great, will it be possible to see it in your profile?
        • glenngillen5 days ago
          I&#x27;m not sure I follow, which profile do you mean? My profile on HN?<p>I don&#x27;t know if we&#x27;ll keep dissecting every incremental improvement we make as (so far) the general approach is the same as documented in the existing blog post: document common use cases -&gt; benchmark them -&gt; identify bottlenecks&#x2F;expensive hot spots -&gt; fix them -&gt; repeat<p>The main thing changing right now is observing new more frequent use cases (either because we&#x27;re adding new capabilities, or users are doing things we didn&#x27;t entirely predict) and adding them to the test cases.
  • dezsirazvan5 days ago
    [flagged]