DeepL AI Labs
Our journey began with a foundational challenge: creating a speech-to-text engine that meets the DeepL standard for precision. We went beyond existing architectures, developing proprietary models through a focused process of advanced training and high-quality data refinement. This approach has yielded a clear performance advantage.
Our internal benchmarks show our models achieve a market-leading Word Error Rate (WER), delivering more accurate transcriptions than established competitors. Instead of public benchmarks, we evaluate transcription quality on a carefully curated proprietary testset that reflects the business use-cases our customers care about.
A great transcription is just the start. Translating an evolving transcript in real time imposes challenging research questions. When translating an intermediate transcript – for example, imagine the first part of a longer sentence – it is hard to know how the speaker will continue. Most first-generation tools have approached this problem in one of two ways: either by waiting until the full sentence is available, which leads to high translation latency, or by constantly updating the translation output, which produces an unpleasant “flickering” user experience (read more about that here).
For DeepL Voice, we have the ambition to provide a smooth user experience while maintaining high translation quality and low latency. With our long-lasting research experience in neural machine translation, we are uniquely positioned to push the boundaries of real-time translation and deliver a uniquely smooth and stable stream of translated text.
See the difference it makes in these side-by-side screen recordings of DeepL Voice for Meetings (on the right side) and Microsoft Teams translations (on the left side).
By engineering this stable text stream, we solved the primary obstacle to the true goal: effortless, high-quality voice-to-voice conversations. Natural-sounding audio output simply isn't possible when it's being generated from an unstable, flickering script.
With that key in place, we are thrilled to announce that the DeepL Voice-to-Voice project is now in active development within DeepL AI Labs. The initial results are incredibly promising.
Aligned with our mission to build the future of AI workflows for businesses that operate across the globe. Delivering high-quality real-time voice-to-voice translations for many languages is now within reach!
Here is a raw teaser for the text-to-speech (TTS) models with support for voice cloning that the research team is working on at the moment. You can judge the quality for yourself.
However, a good TTS model is just the foundation for a great voice-to-voice experience. Building a product that works in real-time requires much more: a strategy for chunked inference, seamless chaining of generated audio and output speed control for achieving minimal latency.
The quality you just heard is our new baseline. This technology is a core focus for us, and we’ll be sharing more teasers and deep dives as we approach major events later this year. The future of AI-powered communication is nearly here!