How we hit sub-second pickup without sacrificing voice quality

When a caller dials your business, the first 900 milliseconds decide whether they hang up. Older IVRs spend that budget on a greeting and a menu. We spend it picking up.

This post walks through the latency budget Receptic targets for every inbound call — from telephony ringback to the first natural syllable out of the agent — and the engineering trade-offs behind each segment.

The 900ms budget, broken down

Our hard target for time-to-first-audio is under 900ms p95. Here's how it splits:

0–80ms — SIP signaling and media setup with Twilio / Telnyx.
80–160ms — Agent process wake-up, session warm, customer config load.
160–500ms — LLM first-token latency for the opening line.
500–820ms — Streaming TTS first chunk into the call.
820–900ms — Jitter buffer + safety margin.

Warm pools beat cold starts every time

The single biggest latency win came from killing cold starts. We maintain a warm pool of agent processes per region, each pre-loaded with the LLM client, vector index handle, and TTS stream open. When a call lands, the dispatcher hands it to a ready process in under 30ms.

The pool auto-scales on a 30-second window. Idle processes get recycled after 10 minutes to avoid memory drift. We pay for ~15% headroom over peak — a rounding error compared to what a cold start would cost the caller experience.

Streaming TTS, not buffered TTS

The naive approach: wait for the LLM to finish, then synthesize, then play. That's 2-3 seconds of dead air for any non-trivial response. Instead we stream tokens out of the LLM directly into a streaming TTS endpoint, and pipe the resulting audio chunks into the call as they arrive.

for await (const token of llmStream(prompt)) {
  ttsStream.write(token);
}
for await (const audioChunk of ttsStream) {
  rtpSession.send(audioChunk);
}

The first audio chunk leaves the agent before the LLM has even finished thinking. Callers hear the agent start speaking while the next sentence is still being generated.

Backchannel sounds buy you time

When the agent needs to make a tool call (calendar lookup, KB search, CRM write) it can't respond immediately. Instead of dead air, we play a soft backchannel sound — “mm-hmm”, “okay”, a thoughtful pause — that buys 300-600ms of natural-sounding processing time.

The trick isn't making the agent faster. It's making the delay feel human.

What we measure

Every call writes a latency trace with these spans: signal_setup, agent_warm, llm_ttft, tts_ttft, first_audio_out. We page on p95 regression of any single span by more than 15% week-over-week.

Right now we're at 720ms p50 / 880ms p95 for first audio, on calls that don't require a tool call. With a tool call in the opening turn, p95 drifts to 1.4s — and that's where backchannels do the heavy lifting.

What's next

We're testing speculative speech: starting TTS on the most likely opening phrase before the LLM commits. Early data shows ~120ms shaved off p50 with no quality regression. If it holds at scale, sub-700ms p95 is in reach.

Try Receptic

See it answer a real call.

Spin up an agent on a sandbox number in minutes. No credit card to test.

Try the demo

How we hit sub-second pickup without sacrificing voice quality

The 900ms budget, broken down

Warm pools beat cold starts every time

Streaming TTS, not buffered TTS

Backchannel sounds buy you time

What we measure

What's next

See it answer a real call.

More from the blog

Designing five voices people actually want to talk to

Warm transfers, explained: SIP REFER vs. bridged conferences

The agency pricing playbook for AI receptionists