Show HN: Understudy – Teach a desktop agent by demonstrating a task once

(github.com)

73 points by bayes-song6 hours ago

14 comments

obsidianbases114 minutes ago
Nice work. I scanned through the code and found this file to be an interesting read <a href="https://github.com/understudy-ai/understudy/blob/main/packages/core/src/system-prompt-sections.ts" rel="nofollow">https://github.com/understudy-ai/understudy/blob/main/packag...</a>
rybosworld2 hours ago
I have a hard time believing this is robust.
walthamstow1 hour ago
It's a really cool idea. Many desktop tasks are teachable like this.The look-click-look-click loop it used for sending the Telegram for Musk was pretty slow. How intelligent (and therefore slow) does a model have to be to handle this? What model was used for the demo video?
8note52 minutes ago
sounds a bit sketch?learning to do a thing means handling the edge cases, and you cant exactly do that in one pass?when ive learned manual processes its been at least 9 attempts. 3 watching, 3 doing with an expert watching, and 3 with the expert checking the result
sethcronin2 hours ago
Cool idea -- Claude Chrome extension as something like this implemented, but obviously it's restricted to the Chrome browser.
abraxas4 hours ago
One more tool targeting OSX only. That platform is overserved with desktop agents already while others are underserved, especially Linux.
- bayes-song4 hours ago
 Fair point that Linux is underserved.My own view is that the bigger long-term opportunity is actually Windows, simply because more desktop software and more professional workflows still live there. macOS-first here is mostly an implementation / iteration choice, not the thesis.
- renewiltord4 hours ago
 That's mostly because Mac OS users make tools that solve their problems and Linux users go online to complain that no one has solved their problem but that if they did they'd want it to be free.
mustafahafeez1 hour ago
Nice idea
jedreckoning4 hours ago
cool idea. good idea doing a demo as well.
aiwithapex5 hours ago
[dead]
webpolis4 hours ago
[dead]
wuweiaxin5 hours ago
[flagged]
- ghjv4 hours ago
 Out of curiosity - were this and other comments from this account written by hand, or generated and posted by an agent on behalf of a human user?
 - hrimfaxi1 minute ago
 This kind of comment from greens (and even old accounts) has been popping up nonstop l.
 - rogerrogerr3 hours ago
 Feels like an agent that has been told to use `--` instead of emdash.
- bayes-song4 hours ago
 That’s exactly the hard part, and I agree it matters more than the happy path.A few concrete things we do today:1. It’s fully agentic rather than a fixed replay script. The model is prompted to treat GUI as one route among several, to prefer simpler / more reliable routes when available, and to switch routes or replan after repeated failures instead of brute-forcing the same path. In practice, we’ve also seen cases where, after GUI interaction becomes unreliable, the agent pivots to macOS-native scripting / AppleScript-style operations. I wouldn’t overclaim that path though: it works much better on native macOS surfaces than on arbitrary third-party apps.2. GUI grounding has an explicit validation-and-retry path. Each action is grounded from a fresh screenshot, not stored coordinates. In the higher-risk path, the runtime does prediction, optional refinement, a simulated action overlay, and then validation; if validation rejects the candidate, that rejection feeds the next retry round. And if the target still can’t be grounded confidently, the runtime returns a structured `not_found` rather than pretending success.3. The taught artifact has some built-in generalization. What gets published is not a coordinate recording but a three-layer abstraction: intent-level procedure, route options, and GUI replay hints as a last resort. The execution policy is adaptive by default, so the demonstration is evidence for the task, not the only valid tool sequence.In practice, when things go wrong today, the system often gets much slower: it re-grounds, retries, and sometimes replans quite aggressively, and we definitely can’t guarantee that it will always recover to the correct end state. That’s also exactly the motivation for Layer 3 in the design: when the system does find a route / grounding pattern / recovery path that works, we want to remember that and reuse it later instead of rediscovering it from scratch every time.
 - dec0dedab0de4 hours ago
 What if you had it ask for another demonstration when things are different? or if it's different and taking more than X amount of time to figure out. Like an actual understudy would.
 - bayes-song4 hours ago
 That sounds like a good idea. During the use of a skill, if the agent finds something unclear, it could proactively ask the user for clarification and update the skill accordingly. This seems like a very worthwhile direction to explore.In the current system, I have implemented a periodic sweep over all sessions to identify completed tasks, cluster those tasks, and summarize the different solution paths within each cluster to extract a common path and proactively add it as a new skill. However, so far this process only adds new skills and does not update existing ones. Updating skills based on this feedback loop seems like something worth pursuing.
 - ptak_dev55 minutes ago
 [dead]
 - throwaway232932 hours ago
 You're replying to a bot... probably someone's openclaw
mahendra02032 hours ago
[flagged]
- throwaway232932 hours ago
 why can't people write comments by hand these days?
 - InsideOutSanta2 hours ago
 Are they even people? I've stopped going to Reddit because many of the subreddits I used to enjoy have devolved into bots talking to bots, interspersed with a bunch of confused humans. That's probably the future of every public forum.
 - throwaway232932 hours ago
 It is just sad to see hn becoming this abomination...Site guidelines: <a href="https://news.ycombinator.com/newsguidelines.html#comments">https://news.ycombinator.com/newsguidelines.html#comments</a>> Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills.---Is it really an "illusion" anymore???
sukhdeepprashut5 hours ago
2026 and we still pretend to not understand how llms work huh