
Every organisation building voice agents draws from a similar set of widely discussed practices, focused on optimizing latency, tuning VAD (Voice Activity Detection) thresholds, and designing for streaming. This is necessary groundwork, but it is also where most of the available literature and public writing tends to stop.
The agents that continue to feel inconsistent after this stage are not failing on fundamentals. They are limited by deeper structural decisions, how the system behaves under uncertainty, how it handles incorrect inputs, and how it manages state over time. These issues do not appear in benchmarks, but they define production performance.
What follows are five principles drawn from building production-grade voice agents that operate reliably in real-world conditions.
1. Filler phrases are not just a UX (User Experience) decision. They are a latency engineering technique.
Filler phrases are often introduced to make responses sound more natural, such as “let me check that” or “give me a moment.” Their often overlooked value, however, is technical, as they help mask latency. If the time between receiving user input and starting audio output is around 500 milliseconds, an immediately triggered filler phrase can remove the perception of delay. The user experiences continuous output even though the underlying latency remains unchanged.
Effective systems treat filler phrases as part of the latency budget. They are triggered only when delays exceed a threshold, and their duration is matched to the expected wait time. In some cases, non-verbal cues such as subtle processing sounds are used to indicate activity. More advanced implementations align filler responses with the type of query, rather than relying on generic phrases.
2. STT confidence scores will actively mislead you
Speech-to-text (STT) systems provide confidence scores with transcripts, and low-confidence outputs are usually treated with caution. The bigger issue is errors that come with high confidence.
Proper nouns, domain-specific terms, and numbers are often transcribed incorrectly even when the system is highly confident. When the model encounters unfamiliar terms, it replaces them with similar-sounding known words.
These errors are treated as correct by downstream systems, leading to responses that are coherent but based on incorrect input, with no clear signal of failure.
A practical solution is to introduce a domain vocabulary layer between transcription and reasoning. This involves maintaining a list of relevant terms and applying fuzzy matching to correct likely mistakes before passing the input forward. When there is uncertainty, it is more reliable to confirm with the user rather than proceed. Asking the user to verify a key term adds a small interaction step but prevents more complex downstream errors.
3. Your LLM does not know it is in a voice conversation and your TTS (Text to Speech) will expose it
Large language models (LLM) generate text without understanding how it will be spoken, which leads to two issues. The first is structure. Even after removing formatting like lists or markdown, responses are often too long and complex. This works for reading but not for listening. Voice responses require shorter sentences, with one idea at a time and minimal complexity. This needs explicit prompting rules and validation through listening, not just reading.
The second issue is pronunciation. Language models generate text without considering how it will sound when spoken. Outputs such as “Dr. Chen,” “$4.5M,” or “SQL” lack clarity on how they should be spoken, creating inconsistencies in speech output.
This is addressed through a grapheme-to-phoneme layer within the text-to-speech system, which converts text into spoken form. Strong systems handle abbreviations, numbers, and domain-specific terms without manual fixes, ensuring that the output is clear and accurate.
4. Long context does not make your agent smarter. It makes it slower and less reliable.
Context growth creates two problems. The base prompt expands over time as more instructions are added for persona, fallback logic, tools, and edge cases. As it grows, the model becomes less consistent in following instructions.
At the same time, conversation history adds noise. Voice transcripts include filler words, interruptions, and irrelevant speech. Over multiple turns, the useful signal declines, affecting both accuracy and efficiency.
Effective systems treat context as something to manage. Base prompts are kept minimal, with critical instructions placed at the top. Older interactions are summarized instead of stored in full. Confirmed information is stored separately in a structured state, while only recent interactions are used for reasoning. Irrelevant content is removed.
5. Speculative execution is underused and misunderstood
Most voice pipelines operate sequentially, moving from input to transcription, then reasoning, and finally response generation. This introduces avoidable latency. In many cases, user intent can be inferred early in an utterance. Phrases such as “can you book a…” or “what’s the…” provide strong signals about the required action before the user has finished speaking.
Speculative execution addresses this by initiating backend operations in parallel with ongoing processing. API (Application Programming Interface) calls, database queries, or retrieval steps are triggered based on early intent signals rather than waiting for full confirmation. If correct, the data is ready when needed, reducing response time. If incorrect, the result is discarded with minimal impact. This requires additional components, including a lightweight intent classifier and clear separation between speculative and final execution paths. When implemented correctly, it is highly effective in structured, high-frequency workflows.
The real answer is better pipelines
Many of these issues will improve as models evolve. However, they are fundamentally system design challenges involving orchestration, state management, input handling, and latency control. Reliable voice agents are built through disciplined pipeline design, not just model improvements.