The voicemail counter on the front-desk phone at a physiotherapy practice in Utrecht read 11 unheard messages when the door unlocked at 08:00 on a Tuesday in April. By 08:47 it was 17. Marieke, the receptionist, had not yet finished her coffee. Two of the messages were from the same patient, three minutes apart, because she had not been sure the first one went through. The practice asked whether a voice agent could pick up the slack without scaring off the 78-year-olds.
That practice is where we ran a week-long field test of a Dutch-speaking voice agent in late April 2026. Three therapists, around 220 active patients, paper-thin margins on receptionist time, and a calendar that lived in software (Intramed) without a usable public API. This is what we built, what broke on day one, and the numbers after seven days.
The brief
Inbound calls split predictably across a typical week. Roughly 60% reschedules, 20% new-patient inquiries, 10% insurance questions, 10% "can I speak to my therapist". Volume peaked Monday morning (queue of 9 by 10:00) and late Friday afternoon, when people remembered they had a Saturday plan.
The brief from the practice owner was narrow and clear. Absorb reschedules. Do not book new patients. Never make an elderly caller feel they are arguing with a vending machine. Hand off cleanly when something sounds clinically off.
Stack and what we ruled out
The bones of the system:
- Dutch geographic number on Twilio Media Streams, forwarded from the practice's existing landline after three rings.
- Speech-to-speech reasoning via the OpenAI Realtime API over WebSocket.
- Text-to-speech swapped to ElevenLabs Multilingual v2 with a Dutch voice. The default Realtime voices were intelligible but read as American-coded to older callers in early test calls.
- Calendar via a shared Google Calendar that Marieke already mirrored from Intramed. Direct integration with Intramed would have eaten the entire week.
- SMS confirmation via Messagebird, so the caller had a written record of the booking before hanging up.
- Post-call reconciler running every 60 seconds, checking the calendar for double-books and flagging them to a Slack channel.
We considered a fully local stack (Whisper plus a 70B-class model plus an open-weights Dutch TTS). Latency was achievable on a single H100, but the cost-per-call did not beat the hosted path at this volume, and we did not want to be the first ones debugging an open-weights Dutch voice clone on a Tuesday afternoon when the practice owner called.
Day one and the things that broke
The first eight hours produced three problems that, in hindsight, were all foreseeable.
Voice activity detection was too eager. The default VAD treats a 200 ms pause as turn-end. That is fine for someone dictating a Slack message. It is wrong for Mrs. de Vries, 78, who is looking up her appointment card in her handbag. We pushed server-side VAD to 700 ms of trailing silence and added a soft prompt to the system instructions: if the caller is speaking slowly, do not interrupt, wait.
Dutch numbers parsed as two values. Spoken Dutch dates are a minefield. "Vierentwintig mei" (the 24th of May) was being transcribed correctly but the model emitted JSON like {"day": 4, "extra": 20} before recovering. We added a Dutch-specific tool schema and a few-shot block with five spoken-date examples covering the worst offenders (vierentwintigste, eenendertigste, half drie, kwart voor tien). The error rate on date extraction dropped from one in five to one in fifty across the rest of the week.
The calendar had a race. Two callers wanted the same Thursday 14:30 slot. Both ended up confirmed. One of them was the practice owner's mother-in-law, which is the kind of bug you remember. We added a per-slot advisory lock in Redis around the read-check-write window of the booking tool, plus the reconciler that runs every minute and surfaces conflicts before the affected patient walks in.
If your booking tool reads availability and writes a confirmation as two separate calls, you have a race condition. It will fire on the day a relative calls in. Build the lock first.
Policy choices that mattered more than the model
The model was not the interesting part. The policy was.
The voice agent was forbidden from booking a new patient under any circumstances. New patients trigger insurance verification, intake forms, and a first-visit slot that is longer than standard. We routed every "I have never been here before" turn to a queued callback within four working hours.
The voice agent escalated immediately on a fixed list of phrases: "pain after a fall", "couldn't move", "numbness", "since the accident", "since the operation", and anything where the caller asked for a same-day slot. Same-day requests went to Marieke because they almost always involve a judgement about whether a therapist should pick up the phone directly.
Every confirmation read back date, day-of-week, time, and therapist name. ("Donderdag 24 mei, om half drie, bij Bram.") This single line cut the reconciler's flagged-conflict rate by more than half, because callers caught wrong dates before the booking was written.
We logged STT confidence per turn. When confidence dropped below 0.7 for two consecutive turns, the agent slowed its TTS rate by 15% and offered: "Hoort u mij goed?" ("Can you hear me well?") That single heuristic correlated with older callers and noisy lines, and it dropped mid-call hangups in that segment from eight on Monday to two by Friday. The clinic also operates under Dutch medical data rules (AVG plus NEN 7510), so the policy layer doubled as the place we logged what the agent did and did not retain in transcripts. The Autoriteit Persoonsgegevens takes a dim view of a chatty AI sitting on a phone line with no retention policy.
The numbers after seven days
Seven calendar days. 312 inbound calls. The practice was open Monday to Saturday morning.
- 198 calls (63%) handled end-to-end without human involvement.
- 84 calls (27%) escalated cleanly to Marieke, either by policy (new patient, clinical phrase, same-day) or because the caller asked for a person.
- 30 calls (10%) dropped during or before the voice agent's opening line.
- Mean call duration: 1:42. Median: 1:28.
- Reschedule completion rate on attempted reschedules: 89%.
- New-patient bookings made by the agent: 0 (by design).
- Double-bookings caught by the reconciler: 2. Both resolved before the patient arrived.
- Complaints registered with the practice: 1, from a 71-year-old who wanted a human. We honoured it and added her number to a list that bypasses the agent.
The receptionist's outbound callback queue (voicemails she had to return) dropped from a Monday-morning average of 14 to 2 by the end of week one. That is the number that matters. It is not a benchmark, it is a person's afternoon back.
The interesting metric was not "calls handled by the agent". It was "voicemails the receptionist no longer had to return on Monday morning". Optimise for the human you are protecting.
A small piece of the actual config
For anyone wiring up the Realtime API for a Dutch-speaking practice, two things in the session config did most of the work. Server-side VAD with a longer silence threshold, and a short, opinionated system prompt that names the practice and refuses anything outside scope. Sketch:
{
"type": "session.update",
"session": {
"modalities": ["audio", "text"],
"instructions": "Je bent de telefonische assistent van Praktijk X in Utrecht. Spreek Nederlands, rustig en beknopt. Je plant alleen vervolgafspraken voor bestaande patienten. Nieuwe patienten en spoed verbind je door. Lees iedere bevestiging hardop terug: datum, dag, tijd, therapeut.",
"voice": "external",
"input_audio_format": "g711_ulaw",
"output_audio_format": "g711_ulaw",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 700
},
"tools": [
{ "type": "function", "name": "find_slots" },
{ "type": "function", "name": "book_slot" },
{ "type": "function", "name": "escalate_to_human" }
]
}
}
The TTS audio comes from ElevenLabs and is streamed back to Twilio as g711 mu-law. The Realtime API handles transcription and reasoning; the voice is rendered externally. That split costs you about 250 ms of perceived latency on first-token, and it is worth every millisecond when an older caller hears a voice they parse as Dutch first and synthetic second.
What we would do differently next time
Three things, in order of how much pain they would have saved.
First, build the reconciler before going live, not after the third double-booking. A 30-line script that diffs the calendar every minute and posts conflicts to Slack would have flagged the mother-in-law case in under 90 seconds. We had it running by Wednesday. It should have been running on Sunday.
Second, voice clone first, default voice never. The complete-call rate on older callers jumped meaningfully when we swapped to a Dutch ElevenLabs voice. The default voices on most realtime APIs sound American by default, and Dutch older callers notice within two seconds.
Third, ship a "press 9 for a person" affordance from the first call. We did not, because we wanted clean data on agent performance. We should have. The one complaint we received would not have existed, and we would have lost nothing measurable.
The practice has kept the system running. The contract for week two onward includes a slow expansion of scope: insurance question deflection (read-only), and an outbound reminder call the evening before each appointment. Same architecture, different prompt.
When we built this voice agent for the Utrecht practice, the thing we kept running into was that the model was almost never the bottleneck. Policy was. We ended up writing more lines of escalation rules than prompt instructions, and that is the right ratio for any voice agent that touches a medical front desk. If you want to try one yourself, start by listing the five phrases that should never reach the model at all.




