11 comments

  • jldugger2 hours ago
    Every six months or so, someone at work does a hackathon project to automate outage analysis work SRE would likely perform. And every one of them I&#x27;ve seen has been underwhelming and wrong.<p>There&#x27;s like three reasons for this disconnect.<p>1. The agents aren&#x27;t expert at your proprietary code. They can read logs and traces and make educated guesses, but there&#x27;s no world model of your code in there.<p>2. The people building these apps are unqualified to review the output. I used to mock narcissists evaluating ChatGPT quality by asking it for their own biography, but they&#x27;re at least using a domain they are an expert in. Your average MLE has no profound truths about kubernetes or the app. At best, they&#x27;re using some toy &quot;known broken&quot; app to demonstrate under what are basically ideal conditions, but part of the holdout set should be new outages in your app.<p>3. SREs themselves are not so great at causal analysis. Many junior SRE take the &quot;it worked last time&quot; approach, but this embeds a presumption that whatever went wrong &quot;last time&quot; hasn&#x27;t been fixed in code. Your typical senior SRE takes a &quot;what changed?&quot; approach, which is depressingly effective (as it indicates most outages are caused by coworkers). At the highest echelons, I&#x27;ve seen research papers examining meta-stablity and granger causality networks, but I&#x27;m pretty sure nobody in SRE or these RCA agents can explain what they mean.<p>&gt; The key insight: individual session failures look random. But when you cluster the hypotheses, failure patterns emerge.<p>My own insight is mostly bayesian. Typical applications have redundancy of some kind, and you can extract useful signals by separating &quot;good&quot; from &quot;bad&quot;. A simple bayesian score of (100+bad)&#x2F;(100+good) does a relatively good job of removing the &quot;oh that error log always happens&quot; signals. There&#x27;s also likely a path using clickhouse level data and bayesian causal networks, but the problem is traditional bayesian networks are hand crafted by humans.<p>So yea, you can ask an LLM for 100 guesses and do some kind of k-means clustering on them, but you can probably do a better job doing dimensional analysis first and passing that on to the agent.
    • almogbaku1 hour ago
      Hi @jldugger<p>Great points, but I think there&#x27;s a domain confusion here . You&#x27;re describing infra&#x2F;code RCA. Kelet does an AI agent Quality RCA — the agent returns a 200 OK, but gives the wrong answer.<p>The signal space is different. We&#x27;re working with structured LLM traces + explicit quality signals (thumbs down, edits, eval scores), not distributed system logs. Much more tractable.<p>Your Bayesian point actually resonates — separating good from bad sessions and looking for structural differences is close to what we do. But the hypotheses aren&#x27;t &quot;100 LLM guesses + k-means.&quot; Each one is grounded in actual session data: what the user asked, what the agent did, what came back, and what the signal was.<p>Curious about the dimensional analysis point — are you thinking about reducing the feature space before hypothesis generation?
  • yanovskishai3 hours ago
    I imagine it&#x27;s hard to create a very generic tool for this usecase - what are the supported frameworks&#x2F;libs, what does this tool assume about my implementation ?
  • RoiTabach2 hours ago
    This looks Amazing Do you have a LiteLLM integration?
    • almogbaku1 hour ago
      Hi @RoiTabach, OP here<p>Yep. We can integrate with every solution that supports OpenTelemetry :) so it&#x27;s pretty native, just use the integration skill<p>npx skills add kelet-ai&#x2F;skills
  • dwb2 hours ago
    &gt; The key insight<p>I&#x27;m so tired
    • almogbaku1 hour ago
      hey @dwb, OP here<p>Yes. I definitely assisted LLM in writing it. Yeah - I should have stripped it better.<p>Yet it&#x27;s f*ing painful to do error analysis and go through thousands of traces. Hope you can live with my human mistakes
    • whythismatters1 hour ago
      Sadly, they forgot to mention why this matters
    • system161 hour ago
      Also the obligatory “It’s not A. It’s B.”
    • hmokiguess2 hours ago
      Hahahahahahahahahhaa ngl, your comment killed me, some LLM tells are so funny
  • BlueHotDog23 hours ago
    nice. what a crazy space. how is this different vs other telemetry&#x2F;analysis platforms such as langchain&#x2F;braintrust etc?
    • almogbaku1 hour ago
      Hi @BlueHotDog2, OP here<p>langsmith&#x2F;langfuse&#x2F;braintrust collect traces, and then YOU need to look at them and analyze them(error analysis&#x2F;RCA).<p>Kelet do that for you :)<p>Does that make any sense? If not, please tell me, I&#x27;m still trying to figure out how to explain that, lol.
  • halflife2 hours ago
    Kelet as in קלט as in input?
    • almogbaku1 hour ago
      Hi @halflife, OP here<p>YEP, Good catch! Kelet as input&#x2F;prompt in Hebrew :)
  • hadifrt202 hours ago
    in the auickstart, the suggested fixes are called &quot;Prompt Patches&quot; .. does that mean Kelet only surfaces root causes that are fixable in the prompt? What happens when the real bug is in tool selection or retrieval ranking for example?
    • almogbaku1 hour ago
      Hi @hadifrt20, OP here<p>From what we discovered analyzing ~33K+ sessions, most of the time when the agent selects the wrong tool, it&#x27;s because the tool&#x27;s description (i.e., prompt) was not good enough, or there&#x27;s a missing nuance here.<p>That goes exactly under Kelet&#x27;s scope :)
  • peter_parker2 hours ago
    &gt; They just quietly give wrong answers. It&#x27;s not about wrong answers only. They just stuck in a circle sometimes.
  • trannnnun2 hours ago
    jkfrntgijbntbuijhb8ybu
  • yarivk31 minutes ago
    [dead]
  • echelon2 hours ago
    [dead]