2 comments

  • augusteo19 minutes ago
    The trajectory export feature is smart. Evaluation and training data collection in the same tool.<p>I&#x27;m curious how the benchmarks handle non-determinism. Real GUIs have loading states, animations, popups that appear sometimes but not always. Does cuabench control for that, or is variance just part of the measurement?<p>Also interested in what &quot;Windows Arena&quot; tests specifically. Windows has so many edge cases - UAC prompts, driver install dialogs, random update notifications. Those feel like the hard mode for computer-use agents.
  • visarga3 hours ago
    Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;UiPath&#x2F;uipath_enterprise_benchmark" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;UiPath&#x2F;uipath_enterprise_benchmark</a><p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2511.17131" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2511.17131</a>
    • frabonacci3 hours ago
      Hey visarga - I&#x27;m the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We&#x27;re open to collaborating - want to shoot me an email at f@trycua.com?