How we built DeepL’s next-generation LLMs with FP8 for training and inference

Autor: Markus Schnös & Fabian Joswig, DeepL Staff Research HPC EngineersOstatnia aktualizacja: 7 sierpnia 2025

Tech Blog

Spis treści:

When DeepL deployed our current NVIDIA DGX SuperPOD with 544 NVIDIA H100 Tensor Core GPUs, we didn’t just get a big increase in compute power. The H100 also introduces native support for 8-bit floating point (FP8) data types, through a new generation of Tensor Cores, which enable the GPU to perform matrix multiplications and other tensor operations at FP8 precision. By computing matrix multiplications in FP8, we can increase throughput when training and deploying our Large Language Models (LLMs), since most of the compute involved in modern LLMs takes the form of matrix multiplications.

Moving computations from 16-bit to 8-bit precision had a major impact in the development of DeepL’s next-generation LLMs. It enables us to build much larger Language AI models, with far more parameters that deliver significant improvements in quality when judged by language experts, but within the same latency window for production inference. This means translations that outperform our previous models by 1.4x for European languages and 1.7x for more complex language pairs such as English and Japanese, but deliver results effectively as quickly. It also means that our models can handle a greater number of requests across more features and functions, without compromising user experience.

In other words, FP8 training and inference has played a central role in scaling DeepL’s Language AI.

In this post, we want to explain the journey that we have taken, in order to apply FP8 for training and inference, share some of the tools and techniques that underpin this success, and give you an idea of the results that we generate in terms of training and inference performance along the way.

What’s the difference between BF16 and FP8 formats?

The simple explanation of the difference between BFloat16 (BF16) and FP8 is that the latter uses half as many bits to express values. Effectively, BF16 uses the 16 available positions in two bytes of 8 bits each. FP8 uses just one byte.

The number of bits you have available dictates the precision with which you can deploy the mantissa and exponent elements of floating-point numbers in scientific notation. BF16 has 1 sign bit, 8 bits for the exponent and 7 bits for the mantissa. FP8 has half as many bits to play with, with 1 sign bit and 7 bits divided between the exponent and mantissa. Because of this, it has a smaller range and lower precision than BF16. It can represent a narrower range and fewer numbers within that range.

As an example, let’s say we wanted to represent the age of the earth in billions of years (roughly 4.543). In BF16, we can represent this precisely as 0100000010010001.

What about representing that number in FP8? There are actually two FP8 formats to choose from: E4M3 and E5M2. The letters and numbers here represent the way that bits are distributed between the exponent (E) and the mantissa (M). The more bits you devote to the exponent, the larger the range of numbers you can describe, and the more bits you devote to the mantissa, the more numbers that you can describe within that range.

Whichever FP8 format you choose, it’s not actually possible to represent 4.543 precisely. If you choose E4M3 with its relatively greater precision, you can get closest (4.5). In E5M2, the closest you can get is 5.

On the flip side to this lack of range and precision, FP8 enables faster computation and requires significantly reduced memory compared to BF16. This is hugely valuable for inference, and with the right approach, it can be hugely valuable for accelerating training as well. It comes down to the question of how great a range and precision LLM training actually needs. Do you need to represent the age of the earth precisely? Or is getting within 43 million years close enough? If you’re interested in the universe as a whole, you’d probably be happy with that second level of precision. The rounding error only represents about 0.3% of the age of the universe, after all.

DeepL’s journey proves that FP8 can deliver what’s required for high-quality LLM training, and this has unlocked new possibilities for what we train our models to do, and how we deploy them in practice.

Pre-training for DeepL LLMs

The journey we take with FP8 training and inference starts with the pre-training of our LLMs. After pre-training, we fine-tune our models on certain tasks, distil large models into smaller models, do reinforcement learning, and deploy a set of parallelization strategies so we can make use of the huge number of GPUs that we have.

Applying the FP8 format for mixed precision training

We transitioned our existing training code from BF16 to FP8 using NVIDIA Transformer Engine: a training library provided by NVIDIA that accelerates transformer models, and includes support for FP8. Transformer Engine provides essential components that facilitate mixed-precision training, seamlessly managing the conversion between FP8 and BF16 formats and handling scaling factors.

We use the Transformer Engine’s default setup, as recommended by NVIDIA, using E4M3 in the forward pass for training, and then using E5M2 in the backward pass. This effectively means that we’re using the format with higher precision for predicting the probability distribution of the next token, and then the format with lower precision, but a higher range, to compute the gradients necessary for updating the model, when precision is less important. We use each of the two formats for the task it is best suited to.

In the chart below, we’ve plotted all the numbers that can be represented with the E4M3 format. As you’ll see, those numbers are concentrated around zero, with a maximum value of less than 500. In fact, the number of representable values for FP8 formats can be printed in a very short table. The trick to making this format work for training is to be aware of the distribution of those values and work within it.

This involves storing additional scaling factors alongside the FP8 weight tensors to overcome the limited range and prevent overflow and underflow. When performing calculations with your low-precision tensors, you must also consider the scaling factors. For example, when multiplying two tensors, you use the formula: (A_fp8 * A_scale) x (B_fp8 * B_scale), where A_fp8 and B_fp8 are 8-bit tensors and the scales are 32-bit scalars. There is specific hardware support for these operations.

FP8 vs BF16 for training performance

When we talk about training performance we are referring to how quickly a model can be trained given the compute power that’s available to it. To compare training performance between FP8 and BF16, we look at Model FLOPS utilization (MFU), which is the number of floating-point operations per second (FLOPS) that the model performs, as a percentage of the FLOPS that are technically possible with the hardware you have available.

For our comparison, we used the number of FLOPS possible with the BF16 format as a common denominator for our comparison, despite the fact that, technically speaking, more FLOPS become possible once you move to FP8. This enabled us to gauge the incremental gain in use of available processing power that becomes possible when you move from BF16 to FP8.

As shown in the chart below, the efficiency with which our model training uses the compute power available increased from 44.6% MFU to 67% MFU with FP8, effectively accelerating our model training by 50%.

That’s an impressive performance gain, in itself. To get there, we worked with NVIDIA to optimize our use of Transformer Engine features. Based on another training setup, we managed to incrementally increase training performance by 25% over the course of 15 months, taking us up to 80% MFU.

FP8 vs BF16 for LLM training quality

The FP8 gains in terms of training performance are therefore very impressive indeed. However, the output that we really care about as DeepL is training quality. How would this compare to training quality with BF16 precision?

In order to check the quality that FP8 delivers, we tested training one of our models in both formats. This enabled us to compare training losses and downstream quality.

We trained a 1.5B model on three trillion tokens and then compared the quality of the FP8 training vs BF16. The key measure here was training loss, which refers to the ability of the model to predict the next token.

As you’ll see from the chart below, we could detect a small superiority for BF16 over FP8, as shown by our FP8 line hovering just above that for BF16. However, this difference is drowned out by the much wider fluctuations in training loss from one step to the next that occur for both formats, and in both cases, we see the same tangible improvement in minimizing training loss over time.

FP8 vs BF16 downstream training quality

We then moved on to test the quality that training in FP8 vs BF16 delivered in a practical, downstream application.

In this case, we tested how the model got on when working with English and German. We compared validation perplexity, which quantifies the uncertainty a model experiences when predicting the next token in a sequence. Once again, the expectation is that perplexity decreases over time. In this practical scenario, we actually found no degradation in quality with FP8 training compared to BF16.

The net result of moving from BF16 to FP8 is that we are able to train our models faster, with reduced demand on memory, and achieve the same training quality, with only minimal degradation in terms of training loss and comparable validation perplexity. This in effect means that DeepL is able to build more sophisticated models, tackling more complex tasks, through maximizing the utilization of processing power. It widens the scope of what we can do with LLM training significantly.

From FP8 training to inference

The next stage in the journey involves preparing LLMs for production inference. Here the heavy lifting on support is carried out by NVIDIA TensorRT-LLM, which is NVIDIA’s solution for scalable LLM inference, and which supports FP8. It takes the weights of your model from training and builds an engine to optimize the operations of the model to be as compute-efficient as possible, using optimization techniques such as kernel fusion, optimized C++/CUDA code, KV caching and continuous in-flight batching.

The benefits of FP8 for inference

Inference for LLMs always involves the interaction of throughput (the number of tokens that can be processed in a timeframe) and latency. It goes without saying that delivering the best possible customer experience involves controlling latency. However, throughput also matters hugely to DeepL because it defines the number of requests that we can handle at a given time, and therefore the scope of what our model can do in practice.

It's inevitable that, as throughput increases, latency tends to increase as well. Batching multiple requests enables greater throughput, but at the cost of increased latency for each individual request. This potentially impacts the customer experience. However, the inference performance of FP8 vs BF16 significantly changes this balancing act in our favor.

As the chart below shows, for most batch sizes, FP8 can handle double the throughput for the same degree of latency as BF16. If we set a specific latency budget that equates to the optimum experience for our users, we can see this in practice. In effect, FP8 has effectively doubled the effective capacity of our LLMs in terms of throughput.

In other words, the journey from BF16 to FP8 hasn’t just enabled us to build more powerful and sophisticated LLMs for DeepL. It’s also ensured that we are able to apply those LLMs effectively, to deliver optimum customer experiences and scale the impact of our Language AI in the wild. We get faster training of larger models, which can then operate within the same latency parameters, while handling double the number of requests.

What’s next? We've recently deployed the new NVIDIA DGX SuperPOD with NVIDIA DGX GB200 systems, which delivers another near-exponential increase in compute power. What’s really interesting for us is that this machine will introduce a new generation of Tensor Cores that can natively support FP4 tensor operations like matrix multiplications. That’s when our journey begins again. It’s been exciting to see what we can do with a single byte when it comes to training and inference. Watch this space to see what’s possible with half a byte.

About the Authors

Markus Schnös, Staff Research HPC Engineer

Markus Schnös is a Staff Research HPC Engineer at DeepL, where he focuses on scaling LLM training and inference. He has a special interest in distributed training and low-precision floating point computation.

www.linkedin.com/in/markus-schnoes-349300185

https://github.com/Marks101

Fabian Joswig, Staff Research HPC Engineer

Fabian Joswig is a Staff Research Engineer at DeepL, with a background in machine learning, high-performance computing and theoretical particle physics. He focuses on scaling AI models and infrastructure for the world's most accurate translator.

https://www.linkedin.com/in/fabianjoswig/

https://github.com/fjosw

Autor: Markus Schnös & Fabian Joswig, DeepL Staff Research HPC EngineersOstatnia aktualizacja: 7 sierpnia 2025

Tech Blog

Spis treści:

Udostępnij