Show HN: Git bayesect – Bayesian Git bisection for non-deterministic bugs

(github.com)

260 points by hauntsaninja4 days ago

17 comments

rs5458378 hours ago
Really fun work, and the writeup on the math is great. The Beta-Bernoulli conjugacy trick making the marginal likelihood closed-form is elegant.We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.We're building this into sem (<a href="https://github.com/ataraxy-labs/sem" rel="nofollow">https://github.com/ataraxy-labs/sem</a>), which has an entity dependency graph that provides the structural signal.
- sfink8 hours ago
 > We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.
 - rs5458378 hours ago
 Yes, bayesect accuracy increases with more iterations. The comparison was at a fixed budget(300 test runs) when I was running. Sorry should have clarified more on that.
 - hauntsaninja3 hours ago
 Yep, you can run bayesect to an arbitrary confidence level.This script in the repo <a href="https://github.com/hauntsaninja/git_bayesect/blob/main/scripts/calibration.py" rel="nofollow">https://github.com/hauntsaninja/git_bayesect/blob/main/scrip...</a> will show you that a) the confidence level is calibrated, b) how quickly you get to that confidence level (on average, p50 and p95)For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests
 - rs5458371 hour ago
 You're right, at 300 tests bayesect converges to ~97-100% across the board. I reran with calibration.py and confirmed.Went a step further and tested graph-weighted priors (per-commit weight proportional to transitive dependents, Pareto-distributed). The prior helps in the budget-constrained regime:128 commits, 500 trials:Budget=50, 70/30: uniform 22% → graph 33% Budget=50, 80/20: uniform 71% → graph 77% Budget=100, 70/30: uniform 56% → graph 65% At 300 tests the gap disappears since there's enough data to converge anyway. The prior is worth a few bits, which matters when bits are scarce.Script: <a href="https://gist.github.com/rs545837/b3266ecf22e12726f0d55c5646617c43" rel="nofollow">https://gist.github.com/rs545837/b3266ecf22e12726f0d55c56466...</a>
hauntsaninja4 days ago
git bisect works great for tracking down regressions, but relies on the bug presenting deterministically. But what if the bug is non-deterministic? Or worse, your behaviour was always non-deterministic, but something has changed, e.g. your tests went from somewhat flaky to very flaky.In addition to the repo linked in the title, I also wrote up a little bit of the math behind it here: <a href="https://hauntsaninja.github.io/git_bayesect.html" rel="nofollow">https://hauntsaninja.github.io/git_bayesect.html</a>
- ajb9 hours ago
 Nice! I implemented a similar thing a while back: <a href="https://github.com/Ealdwulf/BBChop" rel="nofollow">https://github.com/Ealdwulf/BBChop</a>I'm going to have to check out how you got linear time with Shannon entropy, because I used Renyi entropy to do that, to make the algebra easier.It's also possible to do it over the DAG, rather than a linear history - although that makes the code a lot more complicated. Unfortunately there doesn't seem to be a linear time cumulative sum algorithm over dags, so it's super linear in some cases.
- belden4 hours ago
 My team doesn’t always have cleanly bisectable branches being merged to main —- it’s not uncommon to see “fix syntax error” types of commits.But, to merge we need to have all tests pass. (If tests flakily pass then we get new flakey tests, yay!)I know git-bisect doesn’t support this: but could git-bayesect have an option to only consider merge commits? Being able to track a flake change back to an individual PR would be really useful.
 - sluongng3 hours ago
 You can run bisect with first-parent
 - hauntsaninja3 hours ago
 That sounds right. `git_bayesect` currently uses `--first-parent`, so I think belden's use case should work, but I haven't tested it much on complicated git histories.
- Myrmornis11 hours ago
 This is really cool! Is there an alternative way of thinking about it involving a hidden markov model, looking for a change in value of an unknown latent P(fail)? Or does your approach end up being similar to whatever the appropriate Bayesian approach to the HMM would be?
 - tazsat05121 hour ago
 The HMM framing connects to change-point detection. CUSUM (Cumulative Sum) charts solve a related problem: detecting when a process parameter has shifted by accumulating deviations from an expected value.Key difference: CUSUM assumes sequential observation and asks "when did the distribution shift?" Bayesect asks "which commit should I test next?" — active learning vs passive monitoring.But they could complement each other. If you already have CI pass/fail history, CUSUM on that data gives you a rough change-point estimate for free (no extra test runs), then bayesect refines it with active sampling.
OfirMarom19 minutes ago
What I love about this is you applied Bayesian methods to a problem I never would have thought to use it on!
supermdguy4 days ago
Okay this is really fun and mathematically satisfying. Could even be useful for tough bugs that are technically deterministic, but you might not have precise reproduction steps.Does it support running a test multiple times to get a probability for a single commit instead of just pass/fail? I guess you’d also need to take into account the number of trials to update the Beta properly.
- hauntsaninja4 days ago
 Yay, I had fun with it too!IIUC the way you'd do that right now is just repeatedly recording the individual observations on a single commit, which effectively gives it a probability + the number of trials to do the Beta update. I don't yet have a CLI entrypoint to record a batch observation of (probability, num_trials), but it would be easy to add oneBut ofc part of the magic is that git_bayesect's commit selection tells you how to be maximally sample efficient, so you'd only want to do a batch record if your test has high constant overhead
 - __s10 hours ago
 recompiling can be high constant overhead
 - ajb9 hours ago
 In theory, the algorithm could deal with that by choosing the commit at each step, which gives the best expected information gain; divided by expected test time. In most cases it would be more efficient just to cache the compiled output though.
 - sfink9 hours ago
 This doesn't sound quite right, but I'm not sure why.Perhaps: a reasonable objective would be to say that for N bits of information, I would like to pick the test schedule that requires the least total elapsed time. If you have two candidate commits and a slow recompile time, it seems like your algorithm would do many repeats of commit A until the gain in information per run drops below the expected gain from B divided by the recompile time, then it would do many repeats of B, then go back to A, etc. So there are long runs, but you're still switching back and forth. You would get the same number of bits by doing the same number of test runs for each commit, but batching all of the A runs before all of the B runs.Then again: you wouldn't know how many times to run each in advance, and "run A an infinite number of times, then run B an infinite number of times" is clearly not a winning strategy. Even with a fixed N, I don't think you could figure it out without knowing the results of the runs in advance. So perhaps your algorithm is optimal?It still feels off. You're normalizing everything to bits/sec and choosing the maximum. But comparing an initial test run divided by the rebuild time vs a subsequent test run divided by a much faster time seems like you're pretending a discrete thing is continuous.I wish I could math good.
 ajb8 hours ago
 The general requirement for this approach to be optimal, is called "dynamical consistency". A good description is in [1]. It is the situation where, suppose you have a budget B , and you search until your budget is exhausted. Then you are informed that there is an additional budget, B2, and you can continue searching until that is exhausted. A situation is dynamically consistent if, for any B,B2, the optimal strategy is such that you would make the same choices whether you know that you will get B2 or not.So you are correct that discreteness is a problem, because if you are nearing the end of the budget you may optimally prefer to get more dice rolls than take bigger bets. But the optimal solution is then often analytically intractable (or at least it was - I last read about this a while back), and the entropy approach is often reasonable anyway. (For cases where search effort is significant, a good search plan can be found by simulation).[1] <a href="https://bayes.wustl.edu/etj/articles/search.pdf" rel="nofollow">https://bayes.wustl.edu/etj/articles/search.pdf</a>
 hauntsaninja2 hours ago
 Note that "pick the commit with best expected information gain" in git_bayesect isn't optimal even in the no overhead regime. I provide a counterexample in the writeup, which implies ajb's heuristic is also not optimal. I don't see a tractable way to compute the optimal policy.One idea: if you always spend time testing equal to your constant overhead, I think you're guaranteed to be not more than 2x off optimal.(and agreed with ajb on "just use ccache" in practice!)
Retr0id11 hours ago
Super cool!A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?
- hauntsaninja57 minutes ago
 I don't yet know a better way to do this than using a threshold!I think if you assume perf is normally distributed, you can still get some of the math to work out. But I will need to think more about this... if I ever choose this adventure, I'll post an update on <a href="https://github.com/hauntsaninja/git_bayesect/issues/25" rel="nofollow">https://github.com/hauntsaninja/git_bayesect/issues/25</a>(I really enjoy how many generalisations there are of this problem :-) )
- furyofantares5 hours ago
 I have this same issue a lot.I vibe up a lot of really simple casual games, which should have very minimal demands, and the LLM-agent introduces bad things a lot that don't present right away. Either it takes multiple bad things to notice, or it doesn't really affect anything on a dev machine but is horrible on wasm+mobile builds, or I just don't notice right away.This is all really hard to track down, there's noise in the heuristics, and I don't know if I'm looking for one really dumb thing or a bunch of small things that have happened over time.
 - rs5458371 hour ago
 This is a real pain point. One thing that helps: when an LLM agent makes changes across multiple commits, look at what it actually touched structurally. Often the agent adds a feature in commit 5 but subtly breaks something in commit 3 by changing a shared function it didn't fully understand.
- ajb9 hours ago
 At a guess, you can reuse the entropy part, but you'd need to plug in a new probability distribution.
ju571nk3n2 hours ago
Neat approach. I've been dealing with a similar problem in a different domain -- flaky signal detection in trading data. The Bayesian framing is interesting because most trading signal validators just use fixed thresholds. Curious if the entropy-minimization approach would work for narrowing down which candle pattern rules are unreliable vs genuinely noisy.
SugarReflex10 hours ago
I hope this comment is not out of place, but I am wondering what the application for all this is? How can this help us or what does it teach us or help us prove? I am asking out of genuine curiosity as I barely understand it but I believe it has something to do with probability.edit: thanks for the responses! I was not even familiar with `git bisect` before this, so I've got some new things to learn.
- augusto-moura10 hours ago
 The writeup [1] linked on the README has examples and a better explanation[1]: <a href="https://hauntsaninja.github.io/git_bayesect.html" rel="nofollow">https://hauntsaninja.github.io/git_bayesect.html</a>
- Retr0id9 hours ago
 If you're git bisecting a flakey test, normally your only option is to run the test many times until you're ~certain it's either flakey or not flakey. If your test suite is slow, this can take a long time.One way to think about the tool presented is that it minimizes the number of times you'd need to run your test suite, to locate the bad commit.
 - ajb9 hours ago
 It's worth noting that the analysis (although not this specific algorithm) applies in cases where there is a deterministic approach, but a nondeterministic algorithm is faster.For example, suppose you have some piece of hardware which you can interrogate, but not after it crashes. It crashes at a deterministic point. You can step it forward by any amount of steps, but only examine it's state if it did not crash. If it crashed, you have to go back to the start. (I call this situation "Finnegan Search", after the nursery rhyme which prominently features the line "poor old Finnegan had to begin again").The deterministic algorithm has you do an examination after every step. The nondeterministic algorithm has you choose some number of steps, accepting the risk that you have to go back to the start. The optimal number of steps (and thus the choice of algorithm) depends on the ratio of the cost of examination to the cost of a step. It can be found analytically as the expected information gain per unit time.(Either way the process is pretty annoying and considerable effort in hardware and software design has gone into providing ways to render it unnecessary, but it still crops up sometimes in embedded systems).
- curuinor10 hours ago
 Bayesian inference is, to be overly simple, a way to write probabilistic if-statements and fit them from data. The "if" statement in this case is "if the bug is there...", and of course it's often the case that in actual software that if statement is probabilistic in nature. This thing does git bisect with a flaky bug with bayesian inference handling the flakiness to get you a decent estimate of where the bug is in the git history. It seems to be usable, or at least as usable as a Show HN thingy is expected to be.
- teckywoe8 hours ago
 Opening the discussion to include properties of nondeterministic bugs.Often these bugs depend on timing, caused by unpredictable thread scheduling, CPU load, disk and networking timing, etc. Git commits can affect app timing and change the likelihood of the bug occurring, but in many cases these changes aren't related to the underlying bug. That's distinct from a regular git bisect to find a deterministic bug.One cool bayesect application is to identify the commit that most frequently hits the bug, so it's easier to debug. But more broadly, I'm wondering about the underlying philosophy of bisection for nondeterministic bugs, and when I'd use it?
 - hrmtst9383751 minutes ago
 Bayesian bisection helps when the failure rate shifts enough to measure. Once a bug rides on scheduler jitter or some external side effect you don't control, the posterior gets mushy fast.In practice bayesect gives you a ranked suspect list, and that's still better than poking at the repo blind, but calling the top commit "the cause" is where people fool themselvs. Half the time the winner only nudged timing and the bug still lives somewhere else.
 - rs5458371 hour ago
 [dead]
- ncruces7 hours ago
 If you want to read more about the bisect idea itself, this link has a bunch of interesting use cases:<a href="https://research.swtch.com/bisect" rel="nofollow">https://research.swtch.com/bisect</a>TLDR: you can bisect on more than just "time".
davidkunz11 hours ago
Useful for tests with LLM interactions.
convexly8 hours ago
Do you expose the posterior probabilities anywhere so you can see how confident it is in the result?
- hauntsaninja3 hours ago
  Yes! It will show you the posterior probability as a single commit starts to become more likely. In addition, running `git bayesect status` will show you all posterior probabilities
- ajb6 hours ago
  Not the OP, but Fyi you know that to some extent anyway, because the termination condition is that confidence is above a specified value. This is one of the advantages over just doing git bisect with some finger-in-air test repeat factor. But yeah it can print that too.
IshKebab8 hours ago
Does this tool assume it takes the same amount of time to test two commits once as it does to test one commit twice? Maybe true for interpreted languages, but if you're waiting 15 minutes to compile LLVM you're probably going to want to run your 1 second flaky test more than once. Probably pretty trivial to fix this though?Great idea anyway!
- rs5458371 hour ago
 One cheap optimization for the compile overhead case: skip commits that only touch files unrelated to the failing test. If you know the test's dependency chain, any commit that doesn't touch that chain gets prior weight zero. Equivalent to git bisect skip but automatic. Cuts the search space before you compile anything.
MarcelinoGMX3C5 hours ago
[dead]
jiusanzhou5 hours ago
[dead]
alcor-z5 hours ago
[dead]
aplomb10267 hours ago
[dead]
volume_tech7 hours ago
[flagged]
skrun_dev9 hours ago
[dead]
Caum3 hours ago
[flagged]