The baseline configurations all note <2s and <3s times. I haven't tried any voice AI stuff yet but a 3s latency waiting on a reply seems rage inducing if you're actually trying to accomplish something.<p>Is that really where SOTA is right now?
I've generally observed latency of 500ms to 1s with modern LLM-based voice agents making real calls. That's good enough to have real conversations.<p>I attended VAPI Con earlier this year, and a lot of the discussion centered on how interruptions and turn detection are the next frontier in making voice agents smoother conversationalists. Knowing when to speak is a hard problem even for humans, but when you listen to a lot of voice agent calls, the friction point right now tends to be either interrupting too often or waiting too long to respond.<p>The major players are clearly working on this. Deepgram announced a new SOTA (Flux) for turn detection at the conference. Feels like an area where we'll see even more progress in the next year.
I wonder if it’s possible to do the apple trick of hiding latency using animations. The audio equivalent can be the chime that Siri does after receiving a request.
I think interruptions had better be the top priority. I find text LLMs rage inducing with their BS verbiage that takes multiple prompts to reduce, and they still break promises like one sentence by dropping punctuation. I can't imagine a world where I have to listen to one of these things.
From my personal experience building a few AI IVR demos with Asterisk in early 2025, testing STT/TTS/inference products from a handful of different vendors, a reliable maximum latency of 2-3 seconds sounds like a definite improvement. Just a year ago I saw times from 3 to 8 seconds even on short inputs rendering short outputs. One half of this is of course over-committed resources. But clearly the executional performance of these models is improving.
Been experimenting with having a local Home Assistant agent include a qwen 0.5B model to provide a quick response to indicate that the agent is "thinking" about the request. It seems to work ok for the use case, but it feels like it'd get really repetitive for a 2 way conversation. Another way to handle this would be to have the small model provide the first 3-5 words of a (non-commital) response and feed that in as part of the prompt to the larger model.
Absolutely not.<p>500-1000ms is borderline acceptable.<p>Sub-300ms is closer to SOTA.<p>2000ms or more means people will hang up.
160ms is essentially optimal and you can get down to about 200ms AFAIK.
play "Just a second, one moment please <sounds of typing>".wave as soon as input goes quiet.<p>ChatGPT app has a audio version of the spinner icon when you ask it a question and it needs a second before answering.
Just try Gemini Live on your phone. That's state of the art
Microsoft Foundry's realtime voice API (which itself is wrapping AI models from the major players) has response times in the milliseconds.
No, there are models with sub-second latency for sure
Perhaps you didnt read that these are "production-ready golden baselines validated for enterprise deployment."<p>How does their golden nature not dissuade these concerns for you?
Sesame was the fastest model for a bit. Not sure what that team is doing anymore, they kind of went radio silent.<p><a href="https://app.sesame.com/" rel="nofollow">https://app.sesame.com/</a>