How VoxTranslate works: real-time multilingual voice translation explained
A look under the hood at the pipeline that turns your voice into live, translated speech for everyone in the call — and the four engine tiers that power it.
VoxTranslate lets people who don't share a language hold a normal video conversation. You speak; everyone else reads — and, on the higher tiers, hears — your words in their own language, live. This post walks through what actually happens between the moment you talk and the moment someone on the other side understands you.
The real-time pipeline
Every call runs the same four-stage loop, continuously, for each speaker:
- Capture. Your browser streams short, low-latency audio chunks using the Opus codec — no plugins, no native app.
- Transcribe. Streaming speech recognition turns that audio into text as you speak, with the source language detected automatically or set by you.
- Translate. The transcript is translated in parallel into every language present in the room.
- Deliver. Each listener sees live subtitles in their chosen language. On higher tiers they also hear a natural spoken translation.
Why peer-to-peer matters
The video and audio you share travel directly between browsers over WebRTC, in a mesh of up to four people. Your media isn't recorded or routed through a central server — the server handles signaling, translation and chat relay. Fewer hops means lower latency and a smaller privacy surface.
The four engine tiers
Not every conversation needs the same trade-off between speed, voice quality and cost, so VoxTranslate lets you pick an engine per call.
Standard
The default. Fast, economical streaming recognition with live translated subtitles and a built-in browser voice. Perfect for everyday chats.
Enhanced
A client-direct streaming path tuned for ultra-low latency — roughly sub-250-millisecond responsiveness — across a wide set of languages. Ideal for fast, natural back-and-forth.
Pro
Live AI translation with a natural synthesized voice. The sweet spot for meetings and demos: high quality, a real spoken translation, and a balanced cost.
Premium
The highest-fidelity option, with a natural AI voice and the broadest coverage — all 84 supported languages. Built for high-stakes conversations.
A note on AI output
Transcription and translation are produced by AI and can contain mistakes. The spoken translation you hear is computer-generated — not a recording of the speaker. VoxTranslate is built for everyday communication, not for critical legal, medical or safety decisions.
Try it yourself
The fastest way to understand the pipeline is to feel it. Open a room, pick your language, and have a one-minute conversation with someone in another.