The trajectory export feature is smart. Evaluation and training data collection in the same tool.<p>I'm curious how the benchmarks handle non-determinism. Real GUIs have loading states, animations, popups that appear sometimes but not always. Does cuabench control for that, or is variance just part of the measurement?<p>Also interested in what "Windows Arena" tests specifically. Windows has so many edge cases - UAC prompts, driver install dialogs, random update notifications. Those feel like the hard mode for computer-use agents.