DeepL Expands from Text to Real-Time Voice Translation

DeepL, an AI company known for high-accuracy text translation, is expanding into spoken language with a voice-to-voice translation suite aimed at workplaces, customer support operations, and frontline teams.

The product is designed for real-time conversations in multiple contexts, including online meetings, in-person discussions using mobile devices, and group sessions such as trainings or workshops. DeepL is also releasing an API so developers and enterprises can integrate its translation capabilities into existing products, including call center platforms and custom communication tools.

According to chief executive Jarek Kutylowski, the move into speech is a direct extension of DeepL’s work on written language. The primary technical constraint is the trade-off between speed and accuracy. Real-time translation must maintain low latency to keep conversations fluid while preserving nuance, tone, and domain-specific terminology.

To integrate with existing workflows, DeepL is introducing add-ons for major conferencing platforms such as Zoom and Microsoft Teams. Meeting participants will be able to listen to live translated audio while others speak in their native languages or follow along via translated on-screen captions. The program is currently in early access, with organizations joining a waitlist while DeepL optimizes performance and user experience.

Beyond formal meetings, the system targets mobile and web-based conversations, enabling cross-language communication for both co-located and remote participants. In group environments such as workshops or training sessions, participants can join a shared multilingual conversation via QR code and receive translation in their preferred language.

DeepL reports that its system can be customized with domain-specific vocabularies, including industry jargon, product names, and organization-specific terms. This customization is positioned as particularly relevant for customer service, where misinterpretation of technical language can disrupt interactions. The company’s stated expectation is that a reliable translation layer will allow organizations to support customers in markets where hiring fluent staff is difficult or expensive.

At present, DeepL’s speech pipeline operates in three stages: transcribing audio to text, translating the text, and generating speech in the target language. The company asserts that its established strength in text translation provides a quality advantage. Its long-term objective is to transition to an end-to-end voice model that translates directly between spoken languages without an intermediate text representation.

The competitive landscape includes multiple startups, such as Sanas, Camb.AI, and Palabra, which are working on related problems in real-time speech, accent adaptation, and voice-preserving translation. This activity indicates that spoken language capabilities are becoming a central focus in the development of next-generation AI tools.