How DeepL Voice tackles the unique challenges of real-time speech translation

Key Takeaways

  • Real-time speech translation demands speed, accuracy, and the ability to handle incomplete sentences: challenges text translation doesn’t face.
  • Spoken language includes disfluencies, colloquialisms, and short affirmations that must be filtered rather than translated literally.
  • Getting close to the speed of speech switches meeting participants from passive listeners to active contributors.
  • DeepL Voice minimizes “flickering” by generating flexible translations that can absorb new information mid-sentence.
  • DeepL partnered with linguistics experts and businesses to ensure Voice handles verb placement, sentence structure, and contextual judgment across languages.
  • NEC Corporation became the first company to fully deploy DeepL Voice after its launch.
  • DeepL’s Language AI suite, including Voice, Translator, and Write, gives global enterprises the tools to communicate clearly.

People don’t speak the same way that they write. And they don’t experience spoken conversations in the same way that they experience reading an email or an article. 

Our ability to understand one another in the moment of speaking—bringing together all kinds of verbal and nonverbal communication to instantly grasp what someone means—is the high-wire act of human expression. 

When you take a step back and consider what happens in a conversation, it’s amazing just how much information we convey so quickly. 

When you set yourself the task of translating spoken interactions as they happen, as DeepL has done with our DeepL Voice solution, you uncover all kinds of fascinating insights into what makes translating spoken language different to translating text. 

In this post, I’ll share some of those insights and explain how we’re using them to transform the experience of meetings and conversations.

Real-time, voice-to-voice translation is here: learn how DeepL Voice makes it possible.

Why real-time speech translation is harder than translating text

Instant, conversational communication is fundamentally human and extremely difficult for technology to replicate—even technology as advanced as AI

If you want to create solutions for businesses that can help people follow and participate in conversations in multiple languages, you have to start with a deep understanding of the challenges involved.

Those challenges include replicating the human skill of anticipating what people are saying before they finish saying it. 

When you’re translating speech in live situations, you also need to anticipate how someone’s words can best be expressed in another language. Crucially, you need to do this before you know for certain how the original sentence will end to avoid lengthy time lags. 

The challenge here is that what seems to be an accurate translation of a few words could turn out to be an inaccurate translation once the person completes their sentence. 

When we set about developing DeepL Voice, we knew we couldn’t achieve high-quality live speech translation through technology alone. It depends on a deep interest in and understanding of the different ways language works

That’s why we brought together experts in linguistics as they apply to spoken conversations. We also leveraged DeepL’s powerful contextual understanding of how different languages work

Additionally, we partnered with businesses to explore their priorities and the speech translation experience that creates the most value for them.

Discover how DeepL AI is transforming translation to enable truly borderless business.

Why speed is the top priority in real-time speech translation

One of the first insights that we learned is that timing is everything when it comes to real-time translation of a meeting or a conversation. 

If you can get close to the speed of speech—displaying a sentence’s translation by the time a speaker has finished it—then you can greatly affect how inclusive those meetings can be

Christine Aubry, international coordinator for global patisserie manufacturer Brioche Pasquier, explained further at DeepL Dialogues

She noted faster translations switch people’s mode from passive to active participation. Rather than struggling to keep up with what others are saying in another language, they feel fully up to speed. Like a native language speaker, they have the opportunity to interject, shape the conversation, and actively participate

A second or so makes a huge difference.

Therefore, speed is a top priority when translating real-time speech. But you have to balance speed against other priorities that also have a big impact on people’s experience. Translations must be as accurate as possible to avoid misunderstandings and confusion

And where possible, translations must minimize the “flickering” that occurs when you have to correct previously translated text because the meaning changed. The lower the flickering rate, the easier it is for someone to follow a conversation in a natural way.

How spoken language differs from written language—and why it matters for translation

To translate live speech accurately, it’s important to understand the many differences between the patterns of written language and the rhythms of speech. 

For instance, the way people speak is far more individual and less consistent than the way that they write. They employ distinct turns of phrase and colloquialisms that could stem both from regional dialects and from their particular personality or self-image. 

In addition, people construct and correct sentences as they’re speaking, leading to disfluencies where one grammatically incorrect term instantly follows another, more correct one. Reproducing these literally in translation isn’t helpful to someone trying to understand the meaning. 

Throughout conversations, people also utter short affirmations (such as “uh-huh”) to reassure speakers that they understand or agree with what they’re saying. These help the flow of the conversation itself. However, they clutter translations for people trying to follow in another language. 

It’s helpful to filter these elements of spoken language out of a translation.

Take control of terminology and nuance with DeepL Glossaries.

How DeepL Voice optimizes for accuracy and speed simultaneously

The challenge gets even more interesting when you consider that a real-time translation platform isn’t translating complete sentences. It needs to translate a sentence while someone’s speaking it, when that sentence’s final meaning isn’t yet clear. 

This requires us to optimize translations in a slightly different way. We don’t just want the most accurate translation. We want an accurate translation that’s flexible enough to incorporate new information that might change the direction of what people are saying.

Here’s an example. 

Imagine we’re translating a virtual meeting in which one of the participants is speaking English, and another is following what they’re saying with captions in German.

Our English speaker interrupts the conversation to say, “I found it.” Now, if we assume this is a complete sentence, the best possible German translation would be, “Ich habe es gefunden.” 

However, because this is live speech, we can’t be certain if the sentence is complete or not.

In this case, a better option could be to use a translation like “Ich fand es” instead. Why? When the English speaker goes on to say, “I found it frustrating,” the “ich fand es” translation is perfectly positioned to simply add the word “frustrierend.”

If the first three words were translated as “Ich habe es gefunden,” then we’d need to revise the entire translation. 

That’s the type of major “flicker” that gets in the way of intuitively following a conversation. And that’s the type of problem DeepL aims to minimize wherever possible.

How verb placement across languages shapes real-time translation

Accurate, real-time speech translation involves a wide range of such contextual judgments that are best made when human expertise guides technology. That expertise includes insights into where different languages are likely to position the verbs crucial to a sentence’s meaning. 

If they come at the start (as in French and Spanish), it’s possible to display a translation more quickly than when they come at the end. 

All of this helps a system pause just long enough to be accurate, but not so long as to delay understanding unnecessarily.

Read success stories from DeepL clients: the enterprises who trust our Language AI.

Why human linguistics expertise makes DeepL Voice more accurate

This combination of human linguistics expertise with highly accurate translation is allowing DeepL Voice to make a big difference to the experience of meetings and conversations for international businesses. 

These businesses include Japanese multinational information technology and electronics enterprise NEC Corporation. It became the first company to fully deploy DeepL Voice just a few weeks after its launch. 

The excitement around DeepL Voice reflects the fact that this is a groundbreaking moment for speech translation. The ability to decode and translate what people are saying while they’re saying it multiplies the value we can create for international businesses

It transforms the way that teams can collaborate, builds stronger relationships, and ensures different ideas and perspectives are always included. 

The advances we’ve made so far are already making a major difference to the way organizations operate. There’s much more to come!

Learn why 96% of professional linguists chose DeepL Voice as the superior AI caption translator.

Experience the future of multilingual communication with DeepL

DeepL Voice delivers real-time translated captions to meetings and conversations, so every participant is heard and understood in real time.

Translator and Write extend that precision to every workflow, giving global teams the tools to communicate clearly and accurately. API and Integrations move that power into your daily tools, platforms, and workflows.

Contact Sales to see how DeepL’s specialized Language AI suite can drive global communication across your enterprise.

Share