The baseline configurations all note <2s and <3s times. I haven't tried any voice AI stuff yet but a 3s latency waiting on a reply seems rage inducing if you're actually trying to accomplish something.<p>Is that really where SOTA is right now?
I've generally observed latency of 500ms to 1s with modern LLM-based voice agents making real calls. That's good enough to have real conversations.<p>I attended VAPI Con earlier this year, and a lot of the discussion centered on how interruptions and turn detection are the next frontier in making voice agents smoother conversationalists. Knowing when to speak is a hard problem even for humans, but when you listen to a lot of voice agent calls, the friction point right now tends to be either interrupting too often or waiting too long to respond.<p>The major players are clearly working on this. Deepgram announced a new SOTA (Flux) for turn detection at the conference. Feels like an area where we'll see even more progress in the next year.
I think interruptions had better be the top priority. I find text LLMs rage inducing with their BS verbiage that takes multiple prompts to reduce, and they still break promises like one sentence by dropping punctuation. I can't imagine a world where I have to listen to one of these things.
I wonder if it’s possible to do the apple trick of hiding latency using animations. The audio equivalent can be the chime that Siri does after receiving a request.
Sesame was the fastest model for a bit. Not sure what that team is doing anymore, they kind of went radio silent.<p><a href="https://app.sesame.com/" rel="nofollow">https://app.sesame.com/</a>
Absolutely not.<p>500-1000ms is borderline acceptable.<p>Sub-300ms is closer to SOTA.<p>2000ms or more means people will hang up.
play "Just a second, one moment please <sounds of typing>".wave as soon as input goes quiet.<p>ChatGPT app has a audio version of the spinner icon when you ask it a question and it needs a second before answering.
160ms is essentially optimal and you can get down to about 200ms AFAIK.
Been experimenting with having a local Home Assistant agent include a qwen 0.5B model to provide a quick response to indicate that the agent is "thinking" about the request. It seems to work ok for the use case, but it feels like it'd get really repetitive for a 2 way conversation. Another way to handle this would be to have the small model provide the first 3-5 words of a (non-commital) response and feed that in as part of the prompt to the larger model.
From my personal experience building a few AI IVR demos with Asterisk in early 2025, testing STT/TTS/inference products from a handful of different vendors, a reliable maximum latency of 2-3 seconds sounds like a definite improvement. Just a year ago I saw times from 3 to 8 seconds even on short inputs rendering short outputs. One half of this is of course over-committed resources. But clearly the executional performance of these models is improving.
Very randomly and personally I appear to have experimented with that few months ago, based on a Japanese advent calendar project[1] - the code is all over the place and only works with Japanese speeches, but the gist is as follows. Also in [2].<p>The trick is to NOT wait for the LLM to finish talking, but:<p><pre><code> 1 at end of user VAD, call LLM, stream response into a buffer(simple enough)
2 chunk the response at [commas, periods, newlines], and queue sentence-oid texts
3 pipe queued sentence-oid fragments into a fast classical TTS and queue audio snippets
4 play queued sentence-oid-audio-snippets, maintaining correspondence of consumed audio and text
5 at user VAD, stop and clear everything everywhere, undoing queued unplayed voice, nuking unplayed text from chat log
6 at end of user VAD, feed the amended transcripts that are canonical to user's ears to step 1
7 (make sure to parallelize it all)
</code></pre>
This flow (hypothetically)allow such interactions as:<p><pre><code> user: "what's the date today"
sys: "[today][is thursday], [decem"
user: "sorry yesterday"
sys: "[...uh,][wednesday?][Usually?]"
1: https://developers.cyberagent.co.jp/blog/archives/44592/
2: https://gist.github.com/numpad0/18ae612675688eeccd3af5eabcfdf686</code></pre>
Microsoft Foundry's realtime voice API (which itself is wrapping AI models from the major players) has response times in the milliseconds.
No, there are models with sub-second latency for sure
Just try Gemini Live on your phone. That's state of the art
Perhaps you didnt read that these are "production-ready golden baselines validated for enterprise deployment."<p>How does their golden nature not dissuade these concerns for you?