I’ve tried several, including this one, and I’ve settled on VoiceInk (local, one-time payment), and with Parakeet V3 it’s stunningly fast (near-instant) and accurate enough to talk to LLMs/code-agents, in the sense that the slight drop in accuracy relative to Whisper Turbo3 is immaterial since they can “read between the lines” anyway.<p>My regular cycle is to talk informally to the CLI agent and ask it to “say back to me what you understood”, and it almost always produces a nice clean and clear version. This simultaneously works as confirmation of its understanding and also as a sort of spec which likely helps keep the agent on track.
I have dystonia which often stiffens my arms in a way that makes it impossible for me to type on a keyboard. TTS apps like SuperWhisper have proven to be very helpful for me in such situations. I am hoping to get a similar experience out of "Handy" (very apt maming from my perspective).<p>I do, however, wonder if there is a way all these TTS tools can get to the next level. The generated text should not be just a verbatim copy of what I just said, but depending on the context, it should elaborate. For example, if my cursor is actively inside an editor/IDE with some code, my coding-related verbal prompts should actually generate the right/desired code in that IDE.<p>Perhaps this is a bit of combining TTS with computer-use.
I made something called `ultraplan`. It's is a CLI tool that records multi-modal context (audio transcription via local Whisper, screenshots, clipboard content, etc.) into a timeline that AI agents like Claude Code can consume.<p>I have a claude skill `/record` that runs the CLI which starts a new recording. I debug, research, etc., then say "finito" (or choose your own stopword). It outputs a markdown file with your transcribed speech interleaved with screenshots and text that you copied. You can say other keywords like "marco" and it will take a screenshot hands-free.<p>When the session ends, claude reads the timeline (e.g. looks at screenshots) and gets to work.<p>I can clean it up and push to github if anyone would get use out of it.
There’s lots of existing work on “coding by voice” <i>long</i> before LLMs were a thing. For example (from 2013):
<a href="http://xahlee.info/emacs/emacs/using_voice_to_code.html" rel="nofollow">http://xahlee.info/emacs/emacs/using_voice_to_code.html</a>
and the associated HN discussion (“Using Voice to Code Faster than Keyboard”): <a href="https://news.ycombinator.com/item?id=6203805">https://news.ycombinator.com/item?id=6203805</a><p>There’s also more recent-ish research, like <a href="https://dl.acm.org/doi/fullHtml/10.1145/3571884.3597130" rel="nofollow">https://dl.acm.org/doi/fullHtml/10.1145/3571884.3597130</a>
I totally agree with you and largely what you’re describing is one of the reasons I made Handy open source. I really want to see something like this and see someone go experiment with making it happen. I did hear some people playing with using some small local models (moondream, qwen) to get some more context of the computer itself<p>I initially had a ton of keyboard shortcuts in handy for myself when I had a broken finger and was in a cast. It let me play with the simplest form of this contextual thing, as shortcuts could effectively be mapped to certain apps with very clear uses cases
What you said is possible by feeding the output of speech-to-text tools into an LLM. You can prompt the LLM to make sense of what you're trying to achieve and create sets of actions. With a CLI it’s trivial, you can have your verbal command translated into working shell commands. With a GUI it’s slightly more complicated because the LLM agent needs to know what you see on the screen, etc.<p>That CLI bit I mentioned earlier is already possible. For instance, on macOS there’s an app called MacWhisper that can send dictation output to an OpenAI‑compatible endpoint.
Love it. I had been searching for STT app for weeks. Every single app was either paid as a one off or had a monthly subscription. It felt a bit ridiculous having to pay when it’s all powered by such small models on the back end. So I decided to build my own. But then I found “Handy” and it’s been a really amazing partner for me. Super fast, super simple, doesn’t get in my way and it’s constantly updated. I just love it. Thanks a lot for making it! Thanks a lot<p>P.S. The post processing that you are talking about, wouldn’t it be awesome.
Explain to me why a speech-to-text app has 50% of its code in typescript...?
This looks great! What’s missing for me to switch from something like Wispr Flow is the ability to provide a dictionary for commonly mistaken words (name of your company, people, code libraries).
It has something called "Custom Words" which might be what you are describing. Haven't tested this feature yet properly.
There’s a PR for this which will be pulled in soon enough, I can kick off a build of the PR if you want to download a pre release version
Okay so it's more directly text replacements<p><a href="https://github.com/cjpais/Handy/actions/runs/21025848728" rel="nofollow">https://github.com/cjpais/Handy/actions/runs/21025848728</a><p>There is also LLM post processing which can do this, and the built in dictionary feature
I dig that some models have an ability to say how sure they are of words. Manually entering a bunch of special words is ok, but I want to be able to review the output and see what words the model was less sure of, so I can go find out what I might need to add.
Nice. I spent most of Christmas vibe coding with Google Antigravity with one hand while holding a sleeping baby in the other. MacOS built in dictation is OK, but struggles with technical language.
Has anyone compared this with <a href="https://github.com/HeroTools/open-whispr" rel="nofollow">https://github.com/HeroTools/open-whispr</a> already? From the description they seem very similar.<p>Handy first release was June 2025, OpenWhispr a month later. Handy has ~11k GitHub stars, OpenWhispr has ~730.
I did have tried, but the ease of installing handy as just a macOS app is so much simpler than needing to constantly run in npm commands. I think at the time when I was checking it, which was a couple of months ago they did not have the parakeet model, which is a non-whisper model, so I had decided against it. If I remember correctly, the UI was also not the smoothest.<p>Handy’s ui is so clean and minimalistic that you always know what to do or where to go. Yes, it lacks in some advanced features, but honestly, I’ve been using it for two months now and I’ve never looked back or searched for any other STT app.
The OP asked if someone compared both, which usually means actually trying both and not just installing one and skimming through the other's README file. So, in summary, you didn't try both and didn't answer the OP.
It’s incredibly fast on my MacBook m1 air and more accurate that the native speech to text.<p>The ui is well thought out, just the right amount of setting for my usage.<p>Incredible !<p>Btw, do you know what « discharging the model » does ? It’s set to never by default, tried to check if it has an impact on ram or cpu but it doesn’t seem to do anything.
The Parakeet V3 model is really great!
A question because I'm not using speech-to-text, but find it intriguing (especially since it's now possible to do locally and for free).<p>How have your computing habits changed as a result of having this? When do you typically use this instead of typing on the keyboard?
I use it all the time with coding agents, especially if I'm running multiple terminals. It's way faster to talk than type. The only problem is that it looks awkward if there are others around.
Part of my job is to give feedback to people using Word Comments. Using STT, it's been a breeze. The time saving really is great. Thing is, I only do this when working at home with no one around. So really only when WFH.
I just set this up today. I had Whispering app set up on my Windows computer, but it really wasn't working well on my Ubuntu computer that I just set up. I found Handy randomly. It was the last app I needed to go Linux full-time. Thank you!
This looks and works great! A settings option to keep no recording history at all would be terrific.
As a Mac user, am I missing something? macOS has Dictation built-in, when you short press F5 it should start transcribing your spoken words into text in real time. It even does non-English languages.
Besides being trash as others said, there’s a trade off with real time transcription word by word - there’s no opportunity for an AI to holistically correct/clean up the transcription
it's trash if:<p>- you're not a native speaker or have accent<p>- using airpods mic<p>- surroundings is noisy<p>- use novel words like 'claude code'<p>- mumble a bit
Did this thing (or open-whispr) work well with other languages than english ?
On a M4 Macbook Air, there was enough lag to make it unusable for me. I hit the shortcut and start speaking but there was always a 1-2sec delay before it would actually start transcribing even if the icon was displayed.
Curious if you were using AirPods or other Bluetooth headphones for this?<p>If so, there should be "keep microphone on" or similar setting in the config that may help with this, alternatively, I set my microphone to my MacBook mic so that my headphones aren't involved at all and there is much less latency on activation
Yes, I’ve got the same situation too. I kind of learned to wait for one or two seconds before talking. I am using it with the AirPods, so maybe it’s indeed the Bluetooth thing.
What microphone are you using?
Does anyone have a similar mobile application that works locally and is not too expensive? Mostly looking to transcribe voice messages sent over Signal which does not offer this OOTB
I have been using this one from Futo for quite some time and love it: <a href="https://keyboard.futo.org/" rel="nofollow">https://keyboard.futo.org/</a><p>They also have a voice input only version if you still would like to keep your typing keyboard: <a href="https://voiceinput.futo.org/" rel="nofollow">https://voiceinput.futo.org/</a>
There is one single app I've been able to find that offers Parakeet-v3 for free locally and it's called Spokenly. They have paid cloud models available as well, but the local Parakeet-v3 implementation is totally free and is the best STT has to offer these days regardless. Super fast and accurate. I consider single-user STT basically a solved problem at this point.
Use it daily. Looks and works great.
This is so handy, thank you very much. Good work!!
Is it deployed locally or does it send data to your servers?
There's a slightly awkward naming overlap with an existing product.
Which one? I did a quick search but that didn't turn up anything so perhaps it's a partial word overlap or something.<p>I did find the projects "user-facing" home page [1] which was nice. I found it rather hard to find a link from that to the code on GitHub, which was surprising.<p>[1]: <a href="https://handy.computer/" rel="nofollow">https://handy.computer/</a>
This is a slightly German-centric comment.
[dead]
Big Handy fan!
Looks interesting. Why does it need a GUI at all?
As an alternative to Wisprflow, Superwhisper and so on. It works really well compared to the commercial competitors but with a local model.
It doesn’t! Just makes it more accessible to more people I feel. There’s a cli version for Mac which I wrote first handy-cli
Ah, that was a typo: you meant "GPU" (Graphics Processing Unit, not "GUI" which of course is Graphical User Interface) since that is listed in the system requirements. Explained implicitly by an existing comment, thanks!
I hear a CLI request? Tons of CLI speech-to-text tools by the way, really glad to see this. Excellent competitors (Superwhisper, MacWhisper, etc.) are closed/paid.
Because local AI models run well on a GPU, better than on a CPU
So more people can use it?
Would be nice if the output can be piped directly into Claude Code.
Crashes on Tahoe 26.3 Betq 1 :(