Journal · 22 April 2026 · 7 min read

How real-time call translation actually works

A plain-English walk through what happens between someone saying a sentence on a call and the other side reading it in their language. From recording to captions, in under two seconds.

People ask us how the voice call translation works, almost always with a small note of suspicion. It’s a fair instinct. A lot of products in this space promise more than they can do, and the “magic” framing has done none of us favours. So here is the actual chain of events, in plain language, when you’re on a call with someone who speaks a different language.

The microphone

Your phone records your voice in short slices, around two hundred milliseconds at a time. That’s a small enough window to feel live, and a long enough one to contain a recognisable sound. Each slice is encoded with the same kind of codec that any voice call uses; we don’t do anything special here.

The peer connection

NatChatt calls connect peer-to-peer where the network allows. That means the encoded audio leaves your phone, goes through whatever route your two devices can find to each other, and arrives at the other end without travelling through our servers. When the network can’t cooperate, often because one side is on a hotel wifi that refuses outgoing UDP, we fall back through a relay; the contents are still encrypted in transit, and we don’t hold them.

So the person on the other end already has your voice. They’re listening to it in real time, the way they would on any call.

The captions

The captions are a parallel stream. While the call is going on, every second or so we take what you’ve just said and send a chunk of it to Google’s Gemini API. The request is: here’s an audio clip, please return the transcript, please translate it into this target language. Gemini is good at this. It returns a structured response with the transcript in the original language and the translation in the target language.

The translation comes back to us. We push it down a side channel to the other phone, which renders the caption at the bottom of the screen. Then we drop the clip from our side and move on to the next one.

The whole loop, on a steady connection, takes between one and two seconds. On a flaky connection, it can take three or four. We show a small indicator when it’s lagging so you know to slow down a bit; it’s usually enough.

Why it isn’t a robot voice

We deliberately don’t synthesise the other person’s words in a fake voice. Two reasons. One, the technology to do it well is still uncanny. People can usually tell, and it pulls them out of the call. Two, more important, the voice is most of the call. The way your grandmother says your name. The way your partner laughs. We didn’t want to remove that to spare you the work of glancing at a caption.

When captions go wrong

They do, especially with strong accents, low-volume speakers, or words that aren’t in the model’s vocabulary. We try to fail visibly: if the model isn’t confident, we show the caption with the lower-confidence words underlined, so you know to ask.

We also let you tap any caption to see the original transcript, the version in the speaker’s language. If you have any of that language, this is often the fastest way to figure out what was meant.

The privacy bit

This is the question we get the most. Are we listening to the call? Is Google listening to the call?

We never store the audio. We don’t keep a copy of the transcript or the translation past the moment we deliver it to the other phone. We don’t have a database of past call captions. We hold the bare counter of how many seconds you spent on translated calls in a month, for billing, and that counter is dropped after ninety days.

Google does process your audio at the moment the request is made. Under the paid Gemini API terms, that audio is not used to train their public models, and it isn’t stored beyond the time needed to answer the request. We’re aware this is a trust statement we’re reading off their side of the contract; if Google were to change those terms, we would change ours or change providers.

What it doesn’t do

It doesn’t magically make you fluent. It doesn’t catch idioms that don’t have a clean equivalent. It doesn’t handle three people talking over each other in different languages on the same call; we caption the loudest speaker. It struggles with shouting, with sneezing, with the kind of mumble that even a fluent listener wouldn’t catch.

It does, almost always, let two people who can’t speak each other’s language have a real conversation. With pauses, with re-asks, with the occasional confused face. The same as any phone call between two people learning to hear each other for the first time.

A quiet point

If you build voice translation into something, you owe the user honesty about what it can and can’t do. We hope this is what that honesty looks like. If you’re curious how the same thing works for text or for video, we’ve written about that elsewhere on this site. If you’d like to ask us something we haven’t covered, write to us at the contact form. We try to reply.

Written by The NatChatt team. If you’d like to write back, the contact form is at the foot of every page.