5 comments

  • evantahler1 hour ago
    I feel like asking the thing that you are measuring, and don’t trust, to measure itself might not produce the best measurements.
    • john_strinlai1 hour ago
      "we investigated ourselves and found nothing wrong"
  • Retr0id37 minutes ago
    What is "drift"? It seems to be one of those words that LLMs love to say but it doesn't really mean anything ("gap" is another one).
    • jldugger2 minutes ago
      IDK how it applies to LLMs but the original meaning was a change in a distribution over time. Like if you had some model based app trained on American English, but slowly more and more American Spanish users adopt your app; training set distribution is drifting away from the actual usage distribution.<p>In that situation, your model accuracy will look good on holdout sets but underperform in user&#x27;s hands.
    • idle_zealot34 minutes ago
      I believe it&#x27;s businessspeak for &quot;change.&quot; Gap is suittongue for &quot;difference.&quot;
  • aleksiy1231 hour ago
    Interesting approach, I&#x27;ve been particularly interested in tracking and being able to understand if adding skills or tweaking prompts is making things better or worse.<p>Anyone know of any other similar tools that allow you to track across harnesses, while coding?<p>Running evals as a solo dev is too cost restrictive I think.
    • FrankRay7817 minutes ago
      See the very last section in this doc for how I minimise token usage and track savings, all three plugins co-exist fine: <a href="https:&#x2F;&#x2F;github.com&#x2F;FrankRay78&#x2F;NetPace&#x2F;blob&#x2F;main&#x2F;docs&#x2F;agentic-software-development-workflow.md" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;FrankRay78&#x2F;NetPace&#x2F;blob&#x2F;main&#x2F;docs&#x2F;agentic...</a>
  • wongarsu1 hour ago
    See also <a href="https:&#x2F;&#x2F;marginlab.ai&#x2F;trackers&#x2F;claude-code-historical-performance&#x2F;" rel="nofollow">https:&#x2F;&#x2F;marginlab.ai&#x2F;trackers&#x2F;claude-code-historical-perform...</a> for a more conventional approach to track regressions<p>This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets
  • tejpalv3 hours ago
    [dead]