Intent scoring at 60ms: an engineering teardown

When we started building real-time intent scoring for live calls, the first question everyone asked was: why does latency matter if the call is 3 minutes long? The answer is that real-time scoring isn't useful for post-call reporting — it's useful for routing decisions, escalation triggers, and dynamic script changes that happen mid-call. At 300ms, you're routing 5 seconds late. At 60ms, you're routing before the caller finishes their first sentence. That's the difference between a system that reacts and one that leads.

Audio stream

→

20ms chunks

→

Feature extraction

→

Edge inference

→

Intent signal <60ms

Why TTFT is the wrong metric

Time-to-first-token (TTFT) is the standard latency metric in LLM systems — it measures how long until the model starts generating output. For our use case, it's not just incomplete; it's misleading. We don't need the model to start generating — we need it to produce a useful signal that our routing layer can act on.

The metric we optimized for is TTFS: time-to-first-signal — specifically, the time from start of audio input to a routing-actionable intent score. This is a stricter requirement than TTFT in one dimension (we need a complete score, not partial output) and a looser requirement in another (we don't need the full transcript, just the intent classification).

TTFT measures generation start. Useful for conversational AI where the caller expects a response. Not useful for routing decisions where what matters is the routing event, not a generated sentence.

TTFS measures signal readiness. The threshold: a confidence score of ≥0.80 on any intent class. Below 0.80, we classify as "uncertain" and continue collecting audio rather than routing prematurely.

The gap matters. In our architecture, TTFS typically runs 2–3× faster than TTFT because we route on a classification signal, not a generated response. Optimizing for TTFT would have delivered a slower routing system.

Architecture: how we got to 60ms

The 60ms target was non-negotiable for our use case — it's the minimum latency at which real-time call routing can happen before a human caller registers any interaction with the system. Getting there required decisions at four levels of the stack:

Audio chunking. We process in 20ms non-overlapping chunks at 16kHz mono. Most production speech pipelines use 100–200ms chunks because they're optimized for transcription accuracy. At 20ms, we sacrifice some transcription quality in exchange for signal readiness 5–10× earlier. For intent classification (not verbatim transcription), this tradeoff is favorable.
Feature extraction. Rather than running audio through a full transcription pipeline before classification, we extract acoustic and prosodic features (pitch, energy, speaking rate, pause patterns) in parallel with partial transcription. These features carry significant intent signal even before words are fully recognized.
Model pruning and quantization. Our production intent classifier is a pruned version of a larger model, running at INT8 precision. Full-precision inference took 35ms per chunk — too slow for our target. Quantized inference runs in 8–12ms per chunk, with less than 2% accuracy degradation on our validation set.
Edge inference. The model runs on GPU-backed edge nodes co-located with our carrier interconnects. Round-trip to a central inference cluster adds 12–25ms. Running at the edge eliminates this. This was the single largest latency reduction in our architecture — 18ms improvement for moving inference closer to the source.

False positives: the tradeoff you're actually managing

Confidence threshold	False positive rate	TTFS (p50)	Missed signals
0.70	12.4%	38ms	3.2%
0.80	4.1%	58ms	7.8%
0.90	1.2%	94ms	18.3%
0.95	0.4%	142ms	31.5%

We set 0.80 as our production threshold. At 0.70, the false positive rate causes too many incorrect routing decisions — agents receiving callers who don't match the intent signal is worse than a slightly delayed correct signal. At 0.90+, latency climbs past the useful window for live routing decisions, and missed signals leave too many callers unrouted. The 0.80 setting delivers <4% false positives and sub-60ms median TTFS. It's the right tradeoff for production-scale call centers.

Jordan ReyesML Lead at Teldrip

Ready to close the loop on your
revenue stack?

Teldrip Pulse handles call tracking, RTB and attribution. Signal handles telephony, AI voice agents and outbound. Spin up a free trial in minutes.

Try Pulse Try Signal free Talk to sales

7-day free trial on Signal · Cancel any time

Pulse

Signal

Company

Sign in

Intent scoring at 60ms: an engineering teardown

Why TTFT is the wrong metric

Architecture: how we got to 60ms

False positives: the tradeoff you're actually managing

Ready to close the loop on your
revenue stack?

Pulse

Signal

Company

Sign in

Intent scoring at 60ms: an engineering teardown

Why TTFT is the wrong metric

Architecture: how we got to 60ms

False positives: the tradeoff you're actually managing

AI voice agents + human escalation, without the uncanny valley

Closing the loop: server-side conversion APIs in 2026

Setting RTB floor prices by vertical: a 2026 playbook

Ready to close the loop on yourrevenue stack?

Ready to close the loop on your
revenue stack?