DeepL AI Labs

Unlocking high-quality translation for scanned documents and image-based content

Why is document translation so hard? 

This summer, one of our teams took on a challenge that has plagued global companies for years: translating formatted documents while preserving the exact layout and style as the original.   

Think of the wide range of documents you encounter every day: corporate brochures, government guidelines, legal contracts, research reports and more. On the surface, this might sound like an easy fix. After all, today’s powerful AI platforms that can write code, shape business strategies, or reason through complex scientific issues should be able to easily translate and reproduce a legal document to look just like the original.

But the complexity of this task becomes much clearer when you consider a few examples:

  • Word length: Imagine translating a corporate brochure from English to Japanese. The differences in character length create complex challenges in text wrapping around images, page breaks and column layouts.
  • Text styles and font sizes: Formatted documents often include a wide range of layouts, styles and sizes – large bold headlines, italicized quotes, tables, symbols and more. These aren’t just for readability; they are deliberate design choices that reflect brand style and the intent of the original document.
  • Multi-modal content: Another challenge is translating text within images — like diagrams and illustrations — across a wide range of styles and formats.
  • Scanned documents: In the case of scanned files, everything — including the text — is essentially an image. To make matters more challenging, scanned images are rarely perfectly aligned and often include variations in paper textures and backgrounds. This creates even greater challenges for translation accuracy and pixel-perfect layout.

The current approach is not working

Historically, document translation has relied on extracting text from the XML within a Docx file, translating it while retaining the markup, and then re-inserting the translated text. For Docx files, this approach works pretty well since the structured data allows text changes, while leaving the original layout intact. 

But for other documents, like scanned PDF files, the process is less reliable. Converting and extracting text, and then translating and reinserting it, often leads to imperfect images, misaligned text, and mismatched fonts.

DeepL’s breakthrough: From translation to reconstruction

After a few intense months of reimagining how to solve this problem, the DeepL team came up with a fundamentally different approach, which we’ve termed "reconstruction". 

Rather than merely preserving the existing document structure, this new method observes what the layout is, gathers detailed information about it, stores this data, and then uses it alongside the extracted text to completely reconstruct the document — effectively discarding the old document. 

This paradigm shift represents not only a significant technological leap, but also creates new opportunities for how documents can be processed and delivered.

How does document reconstruction work?   

The first step in making this work is to convert every page into an image. These images are then analyzed using advanced Vision Language Models (VLM) technology. Unlike traditional Optical Character Recognition (OCR) methods, VLMs don't just identify individual characters in isolation – they understand the broader context of the document, much like how humans read. When you encounter a smudged word in a faded contract or a partially legible entry in a scanned table, you can often figure out what it says by understanding the surrounding text and the document's structure. VLMs work similarly, using contextual clues to achieve higher accuracy in extracting text, especially when image quality is low or the document layout is complex.

This contextual understanding translates into concrete benefits for businesses: fewer manual corrections after processing, more reliable data extraction from challenging documents like aged contracts or low-resolution scans, and significantly better performance on structured data like tables and forms where traditional OCR often struggles with individual cells. Only when the content is reliably understood can its translation produce dependable output.

This approach not only captures the text, but also information like the bounding boxes for the text, details about background images and other layout heuristics. Once the text is translated, all this rich data – the translation coupled with the information about how the document was laid out – is fed into the powerful rendering engine.  

A new engine for a new kind of job  

One of the biggest challenges we had to overcome was building a new rendering engine. Documents have a wide range of formats to consider, from simple flowing text on a white background, such as letters, to complex tables in financial reports, to figures and charts with labels in research papers, and to complex graphical layouts in colorful brochures. 

After the text has been translated, all these different components need to be reconstructed in the new language as faithfully as possible. DeepL uses a set of technologies to reassemble the previously extracted information and adapt the layout to accommodate the translated text. An important aspect of this is adjusting font sizes to accommodate the different lengths of the original and translated texts. In the final step, the engine compiles all pages into a new PDF and delivers it to the user instantaneously.

Unlocking multi-modal translation with AI

This "reconstruction" approach is foundational to DeepL's expansion into multi-modal content — covering content in different formats, including plain text, audio, images, video, and interactive elements – enabling the translation of a broader range of content beyond purely text-based files. By converting documents into images and then using VLMs to extract the content beyond simple text detection, including comprehensive layout information, DeepL now enables accurate, high-quality translations of a much wider range of scanned documents and images — formats previously challenging to process – while preserving visual integrity. The modularity of these steps also opens up exciting possibilities for creating documents from other sources entirely.

The VLM project represents a pivotal advancement in DeepL's document translation capabilities. By embracing the “reconstruction" approach and leveraging cutting-edge VLM and OCR technologies, we are not only enabling the accurate translation of visually complex documents — like images and scanned PDFs — but also laying the groundwork for highly customizable, workflow-driven solutions. This initiative underscores DeepL's commitment to pushing the boundaries of Language AI, ensuring that our users have access to the most versatile and powerful translation tools available, and paving the way for new applications and deeper integration across diverse professional workflows.

Share

Stay connected

Get sneak previews of our latest AI innovations.