How VoxTranslate works: real-time multilingual voice translation explained

VoxTranslate lets people who don't share a language hold a normal video conversation. You speak; everyone else reads — and, on the higher tiers, hears — your words in their own language, live. This post walks through what actually happens between the moment you talk and the moment someone on the other side understands you.

The real-time pipeline

Every call runs the same four-stage loop, continuously, for each speaker:

Capture. Your browser streams short, low-latency audio chunks using the Opus codec — no plugins, no native app.
Transcribe. Streaming speech recognition turns that audio into text as you speak, with the source language detected automatically or set by you.
Translate. The transcript is translated in parallel into every language present in the room.
Deliver. Each listener sees live subtitles in their chosen language. On higher tiers they also hear a natural spoken translation.

Why peer-to-peer matters

The video and audio you share travel directly between browsers over WebRTC, in a mesh of up to four people. Your media isn't recorded or routed through a central server — the server handles signaling, translation and chat relay. Fewer hops means lower latency and a smaller privacy surface.

The four engine tiers

Not every conversation needs the same trade-off between speed, voice quality and cost, so VoxTranslate lets you pick an engine per call.

Standard

The default. Fast, economical streaming recognition with live translated subtitles and a built-in browser voice. Perfect for everyday chats.

Enhanced

A client-direct streaming path tuned for ultra-low latency — roughly sub-250-millisecond responsiveness — across a wide set of languages. Ideal for fast, natural back-and-forth.

Pro

Live AI translation with a natural synthesized voice. The sweet spot for meetings and demos: high quality, a real spoken translation, and a balanced cost.

Premium

The highest-fidelity option, with a natural AI voice and the broadest coverage — all 84 supported languages. Built for high-stakes conversations.

A note on AI output

Transcription and translation are produced by AI and can contain mistakes. The spoken translation you hear is computer-generated — not a recording of the speaker. VoxTranslate is built for everyday communication, not for critical legal, medical or safety decisions.

Try it yourself

The fastest way to understand the pipeline is to feel it. Open a room, pick your language, and have a one-minute conversation with someone in another.

The real-time pipeline

Why peer-to-peer matters

The four engine tiers

Standard

Enhanced

Pro

Premium

A note on AI output

Try it yourself

Try VoxTranslate free

Related articles

Live subtitles explained: real-time captions on calls

How to host a multilingual video meeting (step by step)

Choosing the right translation tier for your use case