When we started building real-time intent scoring for live calls, the first question everyone asked was: why does latency matter if the call is 3 minutes long? The answer is that real-time scoring isn't useful for post-call reporting — it's useful for routing decisions, escalation triggers, and dynamic script changes that happen mid-call. At 300ms, you're routing 5 seconds late. At 60ms, you're routing before the caller finishes their first sentence. That's the difference between a system that reacts and one that leads.
Why TTFT is the wrong metric
Time-to-first-token (TTFT) is the standard latency metric in LLM systems — it measures how long until the model starts generating output. For our use case, it's not just incomplete; it's misleading. We don't need the model to start generating — we need it to produce a useful signal that our routing layer can act on.
The metric we optimized for is TTFS: time-to-first-signal — specifically, the time from start of audio input to a routing-actionable intent score. This is a stricter requirement than TTFT in one dimension (we need a complete score, not partial output) and a looser requirement in another (we don't need the full transcript, just the intent classification).
Architecture: how we got to 60ms
The 60ms target was non-negotiable for our use case — it's the minimum latency at which real-time call routing can happen before a human caller registers any interaction with the system. Getting there required decisions at four levels of the stack:
- Audio chunking. We process in 20ms non-overlapping chunks at 16kHz mono. Most production speech pipelines use 100–200ms chunks because they're optimized for transcription accuracy. At 20ms, we sacrifice some transcription quality in exchange for signal readiness 5–10× earlier. For intent classification (not verbatim transcription), this tradeoff is favorable.
- Feature extraction. Rather than running audio through a full transcription pipeline before classification, we extract acoustic and prosodic features (pitch, energy, speaking rate, pause patterns) in parallel with partial transcription. These features carry significant intent signal even before words are fully recognized.
- Model pruning and quantization. Our production intent classifier is a pruned version of a larger model, running at INT8 precision. Full-precision inference took 35ms per chunk — too slow for our target. Quantized inference runs in 8–12ms per chunk, with less than 2% accuracy degradation on our validation set.
- Edge inference. The model runs on GPU-backed edge nodes co-located with our carrier interconnects. Round-trip to a central inference cluster adds 12–25ms. Running at the edge eliminates this. This was the single largest latency reduction in our architecture — 18ms improvement for moving inference closer to the source.
False positives: the tradeoff you're actually managing
| Confidence threshold | False positive rate | TTFS (p50) | Missed signals |
|---|---|---|---|
| 0.70 | 12.4% | 38ms | 3.2% |
| 0.80 | 4.1% | 58ms | 7.8% |
| 0.90 | 1.2% | 94ms | 18.3% |
| 0.95 | 0.4% | 142ms | 31.5% |
We set 0.80 as our production threshold. At 0.70, the false positive rate causes too many incorrect routing decisions — agents receiving callers who don't match the intent signal is worse than a slightly delayed correct signal. At 0.90+, latency climbs past the useful window for live routing decisions, and missed signals leave too many callers unrouted. The 0.80 setting delivers <4% false positives and sub-60ms median TTFS. It's the right tradeoff for production-scale call centers.
Ready to close the loop on your
revenue stack?
Teldrip Pulse handles call tracking, RTB and attribution. Signal handles telephony, AI voice agents and outbound. Spin up a free trial in minutes.
7-day free trial on Signal · Cancel any time