Hey, thanks for the question.<p>So from my experience with the LLMs if you ask them directly "is this a bug or a feature" they might start hallucinating and assume stuff that isn't there.<p>I found in a few research/blog posts that if you ask the LLM to categorize (basically label) and provide score in which category this issue belongs it performs very very well.<p>So that's exactly what this tool does, when it sees the failing test it formulates the prompt in a following way:<p>## SOURCE CODE UNDER TEST:
## FAILED TEST CODE:
## PYTEST FAILURE FOR THIS TEST:
## PARSED FAILURE INFO:
## YOUR TASK:
Perform a deep "Step-by-Step" analysis to determine if this failure is:
1. *hallucination*: The test expects behavior, parameters, or side effects that do NOT exist in the source code.
2. *source_bug*: The test is logically correct based on the requirements/signature, but the source code has a bug (e.g., missing await, wrong logic, typo).
3. *mock_issue*: The test is correct but the technical implementation of mocks (especially AsyncMock) is problematic.
4. *test_design_issue*: The test is too brittle, over-mocked, or has poor assertions.<p>Then it also assigns the "confidence" score to it's answer, based on that either full regeneration of the tests proceeds, commenting on the bug in the test, fixing mocks or full test redesign (if it's to brittle)<p>While this is not 100% bullet proof, i found this to be quite effective way - basically using LLM for the categorization.<p>Hope that answers your question!
To clarify, each failing test triggers "review" agent, to determine "why" the test fails, and again, it can be improved with better heuristics probably, more in depth static analysis than the source code, but it is how it works in the current version.
i wonder if always having a design doc of some substance discussing the intended behavior of the whole app would help reduce instances of hallucination. The human developer should create it and let it be accessed by the AI