Intro Clip 0

DeepL AI Labs

Real-time speech translation isn’t just translation with a new form of input or output. It’s a fundamentally new, different and exciting challenge for AI research. It aims to deliver a very different form of user experience that shifts translation priorities, introduces new constraints, and requires new forms of judgment and decision-making from an AI model.

That’s the challenge that Research Manager Sascha Brinker and Research Scientist Kristina Geißler are taking on as part of our voice research team. They’re part of the group evolving DeepL’s superior quality AI model for text translation to set a new standard in translating real-time voice. They’re now building on that early success with new models and training techniques that open up entirely new possibilities for multilingual, real-time speech.

Building on high-quality text translation models

We started from a good place: the quality and contextual understanding of DeepL’s existing text translation model. The Voice team were able to make important early gains by deploying this model, and adjusting inference strategy to increase the speed of translation. They then developed bespoke models for voice that can identify the best moment to output translations, leveraging DeepL’s understanding of the relationships between language pairs, and applying new layers of training.

The goal here is to find the right balance between latency and the speed of translations (crucial to the ability of users to follow and engage with a conversation as it happens), with accuracy and stability. Mastering this balance means that DeepL doesn’t have to wait for the end of a sentence before translating it. At the same time, it minimizes the ‘flickering’ that occurs when models are forced to correct translated subtitles. These things make a huge difference to the user experience.

Taking out the transcription stage

Adapting and evolving our text translation model has brought us a long way. So much so that Slator currently ranks DeepL as the clear leader on both the quality and stability of real-time voice translations. However, removing the requirement to transcribe text before translating it can take us even further, faster. The team are currently building models that can generate translated speech output directly from audio input, without passing through an intermediate text stage.

We can make further gains by providing our model with more context about the conversations it translates: what’s being discussed, who’s discussing it, and the specific phrases and terminology they’re likely to use. This replicates a lot of the intensive training work that top-level human interpreters do before major events or meetings. Just like them, it enables our models to translate what someone is about to say, from the moment they first start forming a word.

Opening up new frontiers for voices ice translation

These new, direct, speech-to-speech models sweep away some of the most important constraints that voice translation currently deals with. In doing so, they open up some very exciting new possibilities.

Without the requirement for translating to text and back, we can gain whole seconds in the time taken to deliver a spoken translation. In the context of following speech in real time, that’s a very significant acceleration that will have a big impact on user and audience experiences.

And there’s more. Working directly with audio input means we can train models to detect accents, dialects and nuances embedded in the way that people speak. Extra inference time and richer, audio inputs mean that we can create spoken outputs that capture the emotion and the deeper meaning of what people say.

The future of real-time, voice translation through AI isn’t just faster. It’s also more profoundly human: capturing more of the many levels on which people communicate when speaking. It’s transforming DeepL from a translation engine into a real-time voice layer, able to enable the most natural form of human communication in a way that makes language disappear as a source of friction.

That’s what makes this one of the most exciting areas of AI Research at DeepL.

Intro Clip 0

Building on high-quality text translation models

Taking out the transcription stage

Opening up new frontiers for voices ice translation

공유