Audio is the one area small labs are winning

(amplifypartners.com)

239 points by rocauc3 days ago

26 comments

anvevoice2 hours ago
One thing I don't see discussed enough: the real-time constraint in audio actually works in small labs' favor.For LLMs, you can throw more compute at the problem and batch requests. For voice, you have a hard latency ceiling — humans notice delays above ~200ms in conversation. That means you can't just scale up a massive model and call it a day. You need lean, efficient architectures that can stream output token-by-token with minimal buffering.This naturally caps model size in a way that keeps the playing field more level. A 1B parameter speech model running on a single GPU can be competitive with anything a big lab deploys, because the latency constraint prevents them from leveraging their main advantage (scale).The other piece is domain specificity. A speech model for medical dictation vs. a voice agent for restaurant reservations vs. a podcast transcriber — these benefit enormously from targeted fine-tuning on relatively small, domain-specific datasets. Big labs optimize for generality. Small teams that deeply understand one vertical can build something meaningfully better for that use case with a fraction of the compute budget.The vladimirralev comment above about GPU utilization being 20-30% for realtime audio is spot on. The serving economics are fundamentally different from text, and that's exactly the kind of problem where clever engineering from a small team beats brute-force scaling.
d4rkp4ttern1 hour ago
It's amazing how good open-weight STT and TTS have gotten, so there's no need to pay for Wispr Flow, Superwhisper, Eleven-Labs etc.Sharing my setup in case it may be useful for others; it's especially useful when working with CLI agents like Code Code or Codex-CLI:STT: Hex [1] (open-source), with Parakeet V3 - stunningly fast, near-instant transcription. The slight accuracy drop relative to bigger models is immaterial when you're talking to an AI. I always ask it to restate back to me what it understood, and it gives back a nicely structured version -- this helps confirm understanding as well as likely helps the CLI agent stay on track. It is a MacOS native app and leverages the CoreML/Neural Engine to get extremely fast transcription (I used to recommend a similar app Handy but it has frequent stuttering issues, and Hex is actually even faster, which I didn't think was possible!)TTS: Kyutai's Pocket-TTS [2], just 100M params, and amazing speech quality (English only). I made a voice plugin [3] based on this, for Claude Code so it can speak out short updates whenever CC stops. It uses a combination of hooks that nudge the main agent to append a speakable summary, falling back to using a headless agent in case the main agent forgets. Turns out to be surprisingly useful. It's also fun as you can customize the speaking style and mirror your vibe and "colorful language" etc.The voice plugin gives commands to control it:<pre><code> /voice:speak stop /voice:speak azelma (change the voice) /voice:speak prompt <your arbitrary prompt to control the style> </code></pre> [1] Hex <a href="https://github.com/kitlangton/Hex" rel="nofollow">https://github.com/kitlangton/Hex</a>[2] Pocket-TTS <a href="https://github.com/kyutai-labs/pocket-tts" rel="nofollow">https://github.com/kyutai-labs/pocket-tts</a>[3] Voice plugin for Claude Code: <a href="https://pchalasani.github.io/claude-code-tools/plugins-detail/voice/" rel="nofollow">https://pchalasani.github.io/claude-code-tools/plugins-detai...</a>
- freedomben22 minutes ago
 Anyone know of something like Hex that runs on Linux?
 - d4rkp4ttern18 minutes ago
 Handy is cross-platform, including linux
Taek7 hours ago
My understanding is that this is purely a strategic choice by the bigger labs. When OpenAI released Whisper, it was by far best-in-class, and they haven't released any major upgrades since then. It's been 3.5 years... Whisper is older than ChatGPT.Gemini 3 Pro Preview has superlative audio listening comprehension. If I send it a recording of myself in a car, with me talking, and another passenger talking to the driver, and the radio playing, me in English, the radio in Portuguese, and the driver+passenger in Spanish, Gemini can parse all 4 audio streams as well as other background noises and give a translation for each one, including figuring out which voice belongs to which person, and what everyone's names are (if it's possible to figure that out from the conversation).I'm sure it would have superlative audio generation capabilities too, if such a feature were enabled.
AustinDev29 minutes ago
Audio models are also tiny, which is probably why small labs are doing well in the space. I run a LoRA'd Whisper v3 Large for a client. We can fit 4 versions of the model in memory at once on a ~$1/hr A10 and have half the VRAM leftover.Each of the LoRA tunes we did took maybe 2-3 hours on the same A10 instance.
- freedomben24 minutes ago
 Is Whisper still getting nontrivial development? I was under the impression that it had stagnated, but it seems hard to find more than just rumors
qoez4 hours ago
OpenAI and google are too scared of music industry lawyers to tackle this. Internally they without a doubt have models that would crush these startups over night if they chose to release them.
- WJW4 hours ago
 Is your claim that music industry lawyers are that much scarier than movie industry lawyers? Because the big labs don't seem to have any problem releasing models that create (possibly infringing) video.
 - spacebanana73 hours ago
 The movie industry is doing well from AI.Thus far AI has only been used to create fan fiction clips that generate free marketing for legacy IP on TikTok. And the rights holders know that if AI gets good enough to make feature length movies then they'll be able to aggressively use various legal mechanisms to take the videos off major sites and pursue the creators. Long term it could potentially lower internal production costs by getting rid of actors & writers.Music is very different. The production cost is already zero, and people generating their own Taylor Swift songs is a real competitive threat to Spotify etc.
 - aleph_minus_one4 hours ago
 > Is your claim that music industry lawyers are that much scarier than movie industry lawyers?Not qoez:You have to balance market opportunities with the risk of reputational damage and litigation risk.Video will probably make a lot more money than audio, so you are willing to take a bigger risk. Additionally, at least for Google there exists a strong synergy between their video generation models and YouTube, which makes it even more sensible for Google to make video models available to the public despite these risks.
 - throawayonthe3 hours ago
 well i guess the music industry is a lot more monopolized than video, plus there is a lot of video out there that isn't "movies," while there's not a lot of music that isn't... "music"
- amelius2 hours ago
 What about Disney's lawyers? GenAI for images exists ...
- KurSix4 hours ago
 I'm not sure it's just fear of lawyers, although that's definitely part of it. Big companies have way more to lose reputationally and legally, so the bar for releasing something is much higher
lukax4 hours ago
Never any mention of Soniox and they are on the Pareto frontier[1]<a href="https://www.daily.co/blog/benchmarking-stt-for-voice-agents/" rel="nofollow">https://www.daily.co/blog/benchmarking-stt-for-voice-agents/</a>
nowittyusername13 hours ago
Good article and I agree with everything in there. For my own voice agent I decided to make him PTT by default as the problems of the model accurately guessing the end of utterance are just too great. I think it can be solved in the future but, I haven't seen a really good example of it being done with modern day tech including this labs. Fundamentally it all comes down to the fact that different humans have different ways of speaking, and the human listening to them updates their own internal model of the speech pattern. Adjusting their own model after a couple of interactions and arriving at the proper way of speaking with said person. Something very similar will need to be done and at very fast latency's for it to succeed in the audio ml world. But I don't think we have anything like that yet. It seems currently best you can do is tune the model on a generic speech pattern that you expect to fit over a larger percentage of the human population and that's about the best you can do, anyone who falls outside of that will feel the pain of getting interrupted every time.
- spuz2 hours ago
 Check out Sparrow-0. The demo shows an impressive ability to predict when the speaker has finished talking:<a href="https://www.tavus.io/post/sparrow-0-advancing-conversational-responsiveness-in-video-agents-with-transformer-based-turn-taking">https://www.tavus.io/post/sparrow-0-advancing-conversational...</a>
- KurSix4 hours ago
 It feels like this is one of those areas where the last 10% of polish will take 90% of the effort
dkarp15 hours ago
There's too much noise at large organizations
- United8579 hours ago
 I see what you did there.
- echelon14 hours ago
 They're focused on soaking up big money first.They'll optimize down the stack once they've sucked all the oxygen out of the room.Little players won't be able to grow through the ceiling the giants create.
 - etherus12 hours ago
 Why would they do that? Once they have their win-condition, there's no reason to innovate. Only to reduce the costs of existing solutions. I expect that unless voice becomes a parameter which drives competition and first-choice for adoption, it will never become a focus of the frontier orgs. Which is curious to me, as almost the opposite of how I'm reading your comment.
beaker525 hours ago
There’s simply not enough of a market for these bigger orgs to be truly interested/invested in audio, video and even image to an extent.They’ll wait for progress to be made and then buy the capability/expertise/talent when the time is right.
hadlock9 hours ago
Can someone reccomend to me: a service that will generate a loopable engine drone for a "WWII Plane Japan Kawasaki Ki-61"? It doesn't have to be perfect, just convincing in a hollywood blockbuster context, and not just a warmed over clone of a Merlin engine sound. Turns out Suno will make whatever background music I need, but I want a "unique sound effect on demand" service. I'm not convinced voice AI stuff is sustainable
- almostdigital9 hours ago
 <a href="https://elevenlabs.io/sound-effects" rel="nofollow">https://elevenlabs.io/sound-effects</a>With the prompt "WWII Plane Japan Kawasaki Ki-61 flying by, propeller airplane" and setting looping on and 30 sec duration manually instead of auto (the duration predictor fails pretty bad at this prompt, you need to be logged in to set duration manually) it works pretty well. No idea if it's close to that specific airplane though it sounds like a ww2 plane to me though.
 - Imustaskforhelp2 hours ago
 are there any open source alternatives to this as well if I may ask?
- nextaccountic7 hours ago
 you mean you want some ai product that generates sound effects from a textual prompt? elevenlabs has a model specifically for that<a href="https://elevenlabs.io/sound-effects" rel="nofollow">https://elevenlabs.io/sound-effects</a>
shenberg4 hours ago
Moshi was an amazing tech demo, building the entire stack from scratch in 6 months with a small team was an amazing show of skill: 7B text LLM data + training, emotive TTS for synth data generation (again model + data collection), synth data pipeline, novel speech codec, rust inference stack for low latency, audio LLM architecture incl. text "thoughts" stream which was novel.But, this piece is a fluff piece: "underfunded" means a total of around $400 million ($330 million in the initial round, $70 million for Gradium). Compare to Elevenlabs who used a $2 million pre-seed for creating their initial product.A bunch of other stuff there is disingenuous, like comparing their 7B model to Llama-3 405B (hint: the 7B model is a _lot_ dumber). There's also the outright lie: team of 4 made Moshi, which is corrected _in the same piece_ to 8 if you read enough.
- Miraltar3 hours ago
 Stopped reading there: "This model (Moshi) could [...] recite an original poem in a French accent (research shows poems sound better this way)."
giancarlostoro15 hours ago
OpenAI being the death star and audio AI being the rebels is such a weird comparison, like what? Wouldn't the real rebels be the ones running their own models locally?
- tgv3 hours ago
 Audio AI companies are just another death star, intent on reducing human creativity to "make a song like Let it be, but in the style of Eminem, and change the lyrics to match the birthday of my mother in law". The only rebels are musicians resisting this hedge-fund driven monstrosity.
- tl2do14 hours ago
 True, but there's a fun irony: the Rebels' X-Wings are powered by GPUs from a company that's... checks relationships ...also supplying the Empire.NVIDIA's basically the galaxy's most successful arms dealer, selling to both sides while convincing everyone they're just "enabling innovation." The real rebels would be training audio models on potato-patched RP2040s. Brave souls, if they exist.
 - tadfisher7 hours ago
 The company behind the T-65B X-Wing, Incom Corporation, did supply the Empire, as they did the Republic Navy before. By 0 BBY, Incom was nationalized by the Imperials. The X-Wing became the mainstay Alliance fighter because the plans were stolen by some defecting Incom engineers.
 - garyfirestorm14 hours ago
 not sure about the irony - you can't really expect rebels to start their own weapons manufacturing lab right from converting ore into steel... these things are often supplied by a large manufacturer (which is often a monopoly) why is it any different for a startup to tap into nvidia's proverbial shovel in order to start digging for gold?
 - tl2do13 hours ago
 [flagged]
 - adithyassekhar12 hours ago
 Your reply has a 0% ai score yet the presence of that "reply to" text is concerning.
 ta89038 hours ago
 Not very concerning because that post is obviously LLM generated.
 llbbdd10 hours ago
 I don't know why this would be surprising because anywhere purporting to be able to score like this is obviously lying to you
- wavemode7 hours ago
 I had a different issue with the metaphor - shouldn't OpenAI be the empire? The death star would be the thing they created, i.e. ChatGPT.
umairnadeem1238 hours ago
i buy the thesis that audio is a wedge because latency/streaming constraints are brutal, but i wonder if it's also just that evaluation is easier. with vision, it's hard to say if a model is 'right' without human taste, but with speech you can measure wer, speaker similarity, diarization errors, and stream jitter. do you think the real moat is infra (real-time) or data (voices / conversational corpora)?
KurSix4 hours ago
Most current "voice assistants" still feel like glorified walkie-talkies... you talk, pause awkwardly, they respond, and any interruption breaks the flow
gmerc5 hours ago
Maybe OpenAI has finally learned that dancing on all the parties at once, when all parties are progressing towards "commodity".
tl2do15 hours ago
This matches my experience. In Kaggle audio competitions, I've seen many competitors struggle with basics like proper PCM filtering - anti-aliasing before downsampling, handling spectral leakage, etc.Audio really is a blue ocean compared to text/image ML. The barriers aren't primarily compute or data - they're knowledge. You can't scale your way out of bad preprocessing or codec choices.When 4 researchers can build Moshi from scratch in 6 months while big labs consider voice "solved," it shows we're still in a phase where domain expertise matters more than scale. There's an enormous opportunity here for teams who understand both ML and signal processing fundamentals.
- derf_13 hours ago
 Also, while the author complains that there is not a lot of high quality data around [0], you do not need a lot of data to train small models. Depending on the problem you are trying to solve, you can do a lot with single-digit gigabytes of audio data. See, e.g., <a href="https://jmvalin.ca/demo/rnnoise/" rel="nofollow">https://jmvalin.ca/demo/rnnoise/</a>[0] Which I do agree with, particularly if you need it to be higher quality or labeled in a particular way: the Fisher database mentioned is narrowband and 8-bit mu-law quantized, and while there are timestamps, they are not accurate enough for millisecond-level active speech determination. It is also less than 6000 conversations totaling less than 1000 hours (x2 speakers, but each is silent over half the time, a fact that can also throw a wrench in some standard algorithms, like volume normalization). It is also English-only.
 - friendzis4 hours ago
 Wrt data, it's not like there is a shortage of transcribed audio in the form of music sheets, lyrics and subtitles.If one asks ~~nice~~ expensive enough they can even get isolated multitracks or teleprompter feeds together with the audiovisual tracks. Heck, if they wanted they could set up dedicated transcription teams for the plethora of podcasts with the costs somewhere in the rounding error range. But you can't siphon that off of torrents and paying for training material goes against the core ethics of the big players.Too bad you can't really scrape tiktok/instagram reels with subtitles... Oh no, oh no, oh no no no no
 - tl2do13 hours ago
 [flagged]
- isoprophlex8 hours ago
 I refuse to believe that none of these people ever heard of Nyquist, and that noone was able to come up with "ayyy lmao let's put a low pass on this before downsampling".Edit: 2 day old account posting stuff that doesn't pass the sniff test. Hmmmm... baited by a bot?
 - yalok7 hours ago
 in ~30 years of my work in DSP domain, I've seen insane amount of ways to do signal processing wrong even for simplest things like passing a buffer and doing resampling.The last example I've seen in one large company, done by a developer lacking audio/DSP experience: they used ffmpeg's resampling lib, but, after every 10ms audio frame processed by resampler, they'd invoke flush(), just for the sake of convenience of having the same number of input and output buffers ... :)
 - isoprophlex6 hours ago
 Haha wow, I guess that gives a very noticeable 100 Hz comb-like effect... noone cared to do a quick sanity check on that output?!
- nubg12 hours ago
 AI bot comment
 - troyvit6 hours ago
 What's the point of saying that without backing it up? Either you think it's so obvious it doesn't need backing up (in which case you don't need to say it), or ...?The reason it matters is that soon, any time somebody sees a comment they don't like or think is stupid, they'll just say, "eh a bot said that," and totally dilute the rest of the discussion, even if the comment was real.
 - friendzis4 hours ago
 We are *quickly* approaching a Tuesday where "bot detected, opinion rejected" is going to be a default assumption.
 - Imustaskforhelp2 hours ago
 It's monday today so I approach we just have to wait for 24 hours for tomorrow.I do feel like AI has made us distrust each other at such a massive scale seen nothing like before. Previously there could still be bot-like comments but they were still written by human or so. Not anymore now and things are getting more sophisticated. I 100% believe that its possible to not have --. You are absolutely right and some other things that I have associated with Ai writing so at this point, AI might write such a way that I can't filter out if its written by AI or not.It's a real psychosis inducing thought imo all while AI is inducing job insecurity from top to bottom.
 gosub1001 hour ago
 > AI might write such a way that I can't filter out if its written by AI or not.This is precisely the point for "farmers" that have no other motivation than to make money. They're not trying to troll us or promote a political view or sell something. They farm accounts and sell them on a secondary market for people who do use them nefariously. An aged, well-upvoted account has value for those groups. So they have every incentive to blend in by parroting back the most popular or neutral talking points.
 - gosub1001 hour ago
 And it will have second-order effects where real humans will reduce their participation because they believe there are only bots or tired of accusations of being one.
- jlehrer110 hours ago
 [dead]
- duped11 hours ago
 imo audio DSP experts are diametrically opposed to AI on moral grounds. Good luck hiring the good ones. It's like paying doctors to design guns.
 - nextaccountic9 hours ago
 Not sure your analogy works, Guantanamo had no trouble hiring medical personnel
 - aleph_minus_one4 hours ago
 > imo audio DSP experts are diametrically opposed to AI on moral grounds.Can you elaborate on this point? I don't know the moral grounds of audio DSP experts, and thus I don't understand why in your opinion they wouldn't take an offer if you really pay them some serious amount of money.Just to be clear: considering what a typical daily job in DSP programming is like, I can imagine that many audio DSP experts are not the best culture fit for AI companies, but this doesn't have anything to do with morality.
bossyTeacher15 hours ago
Surprised ElevenLabs is not mentioned
- gorgoiler9 hours ago
 I was suspicious that they are not mentioned, but then I realized this is a VC opinion piece and the first company mentioned joined their portfolio last year.
- krackers14 hours ago
 Also 15.ai [1][1] <a href="https://en.wikipedia.org/wiki/15.ai" rel="nofollow">https://en.wikipedia.org/wiki/15.ai</a>
mrbluecoat9 hours ago
The bigger players probably avoid it because it's a bigger legal liability: <a href="https://news.ycombinator.com/item?id=47025864">https://news.ycombinator.com/item?id=47025864</a>..plenty of money to be made elsewhere
amelius16 hours ago
Probably because the big companies have their focus elsewhere.
SilverElfin12 hours ago
Does Wisprflow count as an audio “lab”?
- carshodev9 hours ago
 Transcription providers like wisprflow and willow voice are typically providing nice UI/UX around open source models.Wisprflow does not create it's own models but i know willow voice did do extensive finetuning to improve the quality and speed of their transcription models so you may count them.
RobMurray11 hours ago
for a laugh enter nonsense at <a href="https://gradium.ai/" rel="nofollow">https://gradium.ai/</a>You get all kinds of weird noises and random words. Jack is often apologetic about the problem you are having with the Hyperion xt5000 smart hub.
- pain_perdu19 minutes ago
 Hey Rob. I'm not on the tech team here at Gradium (I do GTM) but still curious where you found the glitch? Were you entering words into the STT in the bottom of the front page? Can you share an example so I can replicate? Many thanks!
lysace12 hours ago
Also: porn.Audio is too niche and porn is too ethically messy and legally risky.There's also music, which the giants also don't touch. Suno is actually really impressive.
throwjjj8 hours ago
[dead]
lennexz10 hours ago
[dead]
anvevoice10 hours ago
[flagged]