Inside DeepL’s Journey to Improving Document Translation

目次
- The challenges of recreating translated PDFs
- Developing a quality metric for document translation
- The issues with pixel-based document comparisons
- The Average Bounding Box Overlap Ratio: a better-quality signal
- Choosing libraries for a document translation workflow
- Designing an algorithm to improve document quality score
- Setting a hierarchy of constraints for document translation
- The balance between algorithm and iteration
- About the Authors
Document translation is a valued tool for DeepL users. That’s why our team set itself the task of pushing the quality of translated documents that our platform can deliver. We wanted to set a new standard that put us near the top for quality, no matter what type of document people are translating. In this post, we share some of the considerations and constraints involved in creating a better document translation solution. To users, it seems like a simple process. In reality, it’s a complex one.
For our team, the actual translation of text is the simplest aspect of the document translation workflow. We are DeepL. We have an LLM for that. We can trust the quality of the translation in our document translation, and we constantly improve that quality.
The challenge comes from the document-related aspects of the workflow that sit around that translation. It’s these that determine the quality of the translated document in the eyes of our users.
Put simply, our task involves extracting language from the document ready for translation, and then recreating the document with the translated text in a way that is as close to the original as possible. It’s of limited value having a document perfectly translated if the translated text is in the wrong places, on the wrong pages, loses its flow or loses its connection to the design. If you’ve been using different document translation services over the years, you’ll know how big a challenge this is – and how often translated documents fall short of it.
The challenges of recreating translated PDFs
When we think about documents for translation, our primary focus is PDFs. That’s partly because PDFs represent the majority of the documents that people translate through DeepL. More than 60% of all documents uploaded to our platform are in this format. It’s also because PDFs represent the most challenging form of document for a translation task:
- Their quality varies hugely. Some are rough scans of paper documents. Others are digitally created, with selectable and searchable text objects that are easy to read and convert.
- They cover a fast range of document types and layouts, from exported Word or PowerPoint documents through to complex designs created using specialist software like InDesign or Photoshop.
- They reduce all of the elements in document layouts (design, text and variables such as font and font size) down to an image, from which it can be difficult to extract information.
In short, if you were designing a document format that was suitable for translation, then you’d design something pretty much the opposite of a PDF. We knew that, if we could transform PDFs into translatable text and then transform the text back in a way that successfully replicated the original, we would have cracked the formula for high-quality document translation. It would then be fairly straightforward to adapt the process for more editable file formats like Word, PowerPoint and HTML.
Developing a quality metric for document translation
To begin with, we needed to develop a measure of quality for document translation that we could optimize our approach around.
Our first attempt at this was to use a Structural Similarity Index Metric (SSIM). This treats the original and translated documents as two images and runs a pixel-based comparison of them. It generates a value based on the brightness, contrast and structure of their pixels, related to their neighboring pixels.
The issues with pixel-based document comparisons
There’s a problem with this approach. Pixel-based comparisons, without any sense of context, generate a lot of noise. This obscures the clear signal of quality that we’re looking for. There were two issues, in particular.
- Firstly, the SSIM doesn’t discriminate between minor changes like words dropping down to the next line of text, and major changes like sentences dropping off the page, or columns moving position. If pixels move it’s treated as a change.
- On the other hand, the pixel-based approach doesn’t discriminate between text and images. If a document has large background images, they have a disproportionate impact on the score, since they represent a lot of pixels that don’t change position. They drag the similarity score up even when other elements on the page might be changing significantly.
It's not that SSIM doesn’t work at all as a quality metric. It does detect change. However, these issues mean that metrics only deviate slightly when significant change occurs, and within that slight deviation, there are changes that aren’t actually that significant. There’s too much noise to sift through to use as a reliable quality score.
The Average Bounding Box Overlap Ratio: a better-quality signal
We therefore set about developing our own document translation quality metric, which we call the Average Bounding Box Overlap Ratio. Here’s what we did to generate this:
- We used the open-parse Python library to segment each document page into semantic areas, or bounding boxes, representing elements such as images, headings, body copy and captions.
- We compared the size and position of these segments in the original and translated documents, and calculated the degree of overlap. Within a PDF page, we would aim for all of these elements to occupy the same positions, which generates the highest score.
- We excluded areas of the document that are not classified as any semantic area (background elements, which are primarily white space), in order to reduce the noisiness of the metric further.
- We then calculated an average score across all of the pages in the document.
The end-result? We could demonstrate that our Average Bounding Box Overlap Ratio gives us a clear signal on how well we preserve the layout of the translated document. We combined this with our manual assessments of nine different metrics to generate a combined quality score, which allows us to assess the layout quality of translated documents and choose further improvements without causing regressions.
We also support our ratio score with other, general, PDF-wide metrics. For example, we check that the number of pages, images and tables in a document all remain the same. If any of these counts has changed, then this raises a separate quality flag.
The next challenge involved designing a workflow that could work with the various constraints and dependencies involved in interpreting and recreating a PDF.
Choosing libraries for a document translation workflow
The first dependency is the software tool used to convert the text images within a PDF into machine-readable text for translation. We depend on the Optical Character Recognition (OCR) library that we choose for this task for two important things:
- Capturing the original text accurately, in order for us to translate it.
- Correctly identifying the fonts and font sizes that are used in the original, in order for us to recreate these in the translated document.
The features within DeepL Translate actually provide a useful safety net here. Minor errors in how the text is captured can result in sentences that don’t quite make sense but aren’t wholly inaccurate. DeepL Translate is capable of capturing intended meaning in many of these cases, and still producing an accurate translation.
Starting out with the wrong fonts and sizes can distort the appearance of text when we recreate a document. However, the way that our system manipulates translated text to get the best possible result de-risks this to some extent. Even if the wrong font is identified, it should be similar to the original, and our approach ensures that it should be the optimal size for each piece of text.
We can also monitor the overall performance of our chosen OCR libraries by using our Average Bounding Box Overlap Ratio to pick up on any major distortions that result from fonts being identified incorrectly. We use qualitative feedback on sample PDFs to monitor this as well.
The second major dependency comes at the other end of the workflow. In order to translate the text that we’ve extracted from a PDF, we convert it into DOCX format. We then need a DOCX library that can capture and store the features of the original PDF, and enable us to manipulate the translated text within those features as an intermediary format. This means that, when we’re working with the DOCX, we’re able to take account of where page breaks, columns and headings should be. We also need a library that can turn the DOCX file back into the PDF at the end of the process.
Fortunately, we found that the most suitable library for storing document features and manipulating the translated text, is also the most suitable library for converting the DOCX into PDF. This de-risks the end of our workflow. We don’t have to worry about libraries treating the DOCX in different ways, because the libraries are the same.
Designing an algorithm to improve document quality score
Once we had selected the libraries for both ends of the workflow, we could move to the business end of document translation, from our point of view. This involves developing an algorithmic approach to manipulating the translated text, so that it scores as highly as possible on our combined quality scoring system.
The big challenge here is, of course, language expansion and contraction. The length of sentences and paragraphs in terms of number of characters varies significantly from one language to another. Add in translations that move from one character set to another, such as with German and Chinese, or change the direction of text, such as with Arabic and English, and the parameters can change even more. We’re trying to fit text that is longer, shorter or a different shape, into the same space.
In order to do this effectively, we need to manipulate the text. The question is: which changes will have least impact on the translated document that we end up producing?
Setting a hierarchy of constraints for document translation
The exercise of developing our quality metric had a useful additional output. It helped us to establish a hierarchy of priorities. We could determine which were the biggest factors contributing to the final document experience and balance our model accordingly.
We knew that we needed to transform our translated text back into the document format in a way that respected the following constraints, in the following order of priority:
- Text must remain on the same pages as in the original document.
- Text and images must remain in the same places on the page.
- The ratio of font size between different elements of text (headlines, body copy, captions, etc.) must stay as close as possible to the ratios in the original.
- To the greatest extent possible, font sizes must remain consistent throughout the document.
Why this order of priorities? We know from customer feedback that text escaping from one page to another has the greatest negative impact on the document experience. It also has the greatest knock-on impact on the overall quality score for a document in our Average Bounding Box Overlap Ratio metric. It compromises text and image positions on all subsequent pages and alters the overall number of pages in the document. Moving elements and images around on the page also has a big impact on the ratio. As an added factor, it places greater demands on the library that we depend on to recreate the original PDF layout.
From the human review element in our combined quality score, we know that people respond more to the relative size of copy elements on a page, than they do to the consistency of font size throughout the entire document. They’ll notice if you keep changing the font size on every page, but nothing like as much as they’ll notice if headlines and body copy are suddenly the same size. We established a window by which the ratio between headings and body copy could vary without losing the right balance between different elements.
The balance between algorithm and iteration
Establishing this hierarchy of priorities provided the foundation for an algorithm to identify the right way to deal with translated text.
It’s tempting to assume that the right mathematical formula is all that’s needed here. If we know the average language expansion or contraction, we can feed in the numbers and calculate an optimum percentage adjustment to font size, that will keep text fitting within the same spaces it originally occupied. For our first experiment, that’s what we did: mapping the scaling factors for different languages.
The problem we found is that the ideal, mathematical solution doesn’t really exist in practice. Font sizes are only available as whole or half integer numbers. There is no such thing as a 10.385 version of Times New Roman. When your calculation generates a number with several decimal places, you end up being forced to round up or down, and the overall result ends up falling short of the ideal.
This effect is amplified because the size that works for one segment of text might not work so well for another. Different column widths influence how words move between line breaks, for example. A font size that keeps text within one boundary can lead to it overflowing another.
For these reasons, we found that a model using a straightforward algorithm for resizing text struggles to improve beyond a mediocre quality score. So, we experimented again. We used the algorithm to calculate a likely starting font size, but then iterated for each section of text to find the ideal size for that particular section.
We found that we can keep adjusting each element of text up and down through half-point sizes until it fits as well as possible within its boundary. Because each text element started off with a calculated font size that was in proportion to the other text elements, we’re able to maintain the different ratios between headings and body copy. It may not be the exact ratio we started with, but it’s within the windows that we allow ourselves.
In this way, we’re able to protect the integrity of each text boundary that our ratio score measures, and therefore protect the integrity of each page in the document. We’re able to create a document with the same number of pages, text fitting within the same boundaries on the page, and the layout of each page remaining fundamentally the same. In testing, we can sense-check this through our Average Bounding Box Overlap Ratio and manual assessment, and confirm that we’re delivering the target quality that we have set ourselves.
Of course, that’s not the end of the process. Now that we’ve got a measure of quality that works, we can keep experimenting and optimizing to improve our document output further. We’ll aim to keep producing better results across more languages, more character sets, more font styles and more document types.
About the Authors
Oleksandr Matiiasevych, Senior Product Manager at DeepL
Oleksandr leads document translation at DeepL, ensuring we tackle the most impactful opportunities in this field. With a background in computer science and years of product management experience across different industries, he has always focused on helping customers achieve their business goals.
https://www.linkedin.com/in/matijasevich/
Fabian Grewing, Senior Software Engineer at DeepL
Fabian Grewing has been working on and driving innovation in document translation at DeepL for over 4 years. He is always aiming at developing a reliable, scalable and straightforward product to provide value to our customers.
https://de.linkedin.com/in/fabian-grewing
Joshua Christl, Software Engineer at DeepL
Joshua Christl is a Software Engineer with a strong commitment to code quality. With a background in computer science and a focus on backend development, he seeks to build features that pragmatically address genuine customer needs - while upholding high software quality standards.