How do you deploy an NVIDIA DGX SuperPOD with DGX GB200 systems?

DeepL has just become the first organization in Europe to deploy the new NVIDIA DGX SuperPOD with DGX GB200 systems. We’re excited about it, and with good reason. Based on NVIDIA Grace Blackwell architecture, DGX GB200 systems are a hugely impressive piece of technology that represents a monumental change in the compute power available for training and inference for DeepL.
However, the capabilities of the hardware itself are only part of the story. Deploying DGX GB200 involves embracing the challenge of optimizing the way you train and run large language models (LLMs), in order to take advantage of significantly increased compute power. It’s not just a case of doing what you were already doing, but faster. It involves re-imagining how you work with the technology you have, in order to extract performance and take full advantage of it. It has repercussions for software, model architecture, data and more. Deploying a DGX SuperPOD involves so much more than just unboxing it, connecting it, and turning it on.
DGX SuperPOD infrastructure and configuration
Let’s start with the infrastructure. The increase in compute power expands well beyond individual chip improvements to cross-GPU communication and the scaling this enables. The NVIDIA NVLink bridge enables high-speed interconnect between GPUs, removing one of the most important constraints on performance and pooling several GPUs to act as if they were one. This is particularly significant when training models, which is a computationally intensive task in which the speed at which GPUs can communicate directly with each other influences the size of model you can train.
With our current DGX SuperPOD with DGX H100 systems leveraging NVIDIA’s Hopper GPU architecture, we have islands of 8 GPUs connected to function as a single GPU. With the DGX GB200, the pooling factor increases to 72 GPUs, a factor of nine. Add in the fact that each of these GPUs itself has a higher Video Random Access Memory (VRAM), a greater memory bandwidth and can execute more Floating Point Operations per second (FLOPS), the other most important constraint when training models, and you quickly get a sense of the step-change involved.
DGX GB200 SuperPOD performance
To give a sense of how much extra compute power we’re dealing with, I’ve generated a quick comparison of performance. It’s between our current DGX SuperPOD with DGX H200 systems, ranked in the world’s top 500 supercomputers, which we named Mercury and deployed in 2023, and the new NVIDIA DGX SuperPOD with DGX GB200 systems, which we’ve given the catchy name of Arion (in case you were wondering, it’s named after an exoplanet in the constellation of Delphinus).
Mercury is fast enough to translate the entire Oxford English Dictionary in 39.06 seconds, which is impressive enough. With Arion, though, we could do it in 3.72 seconds. What about translating the entire internet? Mercury could do it in 193.76 days. Arion could manage it in 18.45. That’s a huge difference that turns something hypothetical into something conceivably practical.
That’s only the very start of the story, though. As I mentioned, Arion’s FLOPS performance is already superior to Mercury’s. However, there are adjustments we can make to extract even further gains, and these have big implications for training in particular. This is where the software element in optimizing for the capabilities of the new machine comes in.
Optimizing floating-point formats for DGX GB200
We enabled 8-bit floating-point (FP8) training and inference for Mercury rather than relying on the 16-bit floating-point format (BF16), which most model training has used previously. We were able to do this successfully using NVIDIA’s Transformer Engine (for training) and TensorRT-LLM (for scalable LLM inference). By using fewer bits per number in our operations, we were able to generate more operations per second, with reduced precision but crucially, no compromising of training quality. This increase in FLOPS greatly increased the size of models we could train, and also increased our memory bandwidth, which enhanced inference performance as well, enabling greater throughput in terms of the number of operations, without an increase in latency.
With the next-gen capabilities of Arion, we envisage being able to reduce bits per number further, from FP8 to FP4. This will deliver further gains in both training model size and inference. If we were to use the same comparison as before, translating the Oxford English Dictionary with the extra FLOPS and memory bandwidth possible through FP4 would take just 2.07 seconds and the entire worldwide web would take just 10.25 days.
Optimizing model architecture for DGX GB200
Advances in hardware have a closely intertwined relationship with the architecture of the models that you train with that hardware. This is less of a linear relationship, and much more of a cyclical one. Rather than architectures becoming ever more complex or simpler as compute power advances, they tend to become more complex up to a point, and then revert to a simpler mean.
To see this pattern in action, let’s take a look back to when DeepL first launched, in 2017, or when I first joined in 2020. In those days, we were running fairly simple architectures that were super-optimized to run on the hardware that we had at the time. In order to squeeze the best possible performance out of those models, we kept developing them into an ever-more complex and customized sequence-to-sequence architecture that was unique to DeepL. We developed all kinds of clever ways to improve the status quo, until we reached a ceiling. Adding complexity was no longer the most effective way to increase performance. To do so, we had to change the hardware, and change the possibilities. We first did this by deploying Mercury. Now, we’re doing it again by deploying Arion.
The irony is that, when compute capacity suddenly grows in this way, the highly customized architecture you’ve developed to maximize what you can do with the old parameters of FLOPS and memory bandwidth no longer applies. Your local improvements and legacy architecture now risk getting in the way of larger models that no longer require them. A simpler architecture is much more capable of scaling, and much more capable of adjusting to the new possibilities that you’re dealing with.
This is often referred to as the bitter lesson of AI research. Human knowledge develops by becoming more complex, but AI learns most powerfully through the simplest possible models. There’s an emotional wrench to effectively throwing out architectures and software that you’ve built for your organization, and in which you’ve invested your ingenuity and creativity. However, it’s essential to avoid painting yourself into a legacy corner.
Culturally, AI research depends on being ready to move on from what you currently have, and work with what’s now available. It was when DeepL moved to Mercury, that we moved from our highly bespoke original models to LLMs running on a much simpler architecture, which were eminently more scalable, and which ultimately enabled us to do far more.
The bitter lesson also teaches us that this pattern will repeat. Even as we’ve worked to develop our latest LLMs, we’ve known that we would at some stage be scaling up our compute capacity, moving to training bigger models, and leaving our optimizations behind. Deploying our DGX GB200 SuperPOD Arion will involve moving on from local improvements we’ve made and reverting back to a global optimum, in order to move forward faster.
Data-scaling strategies for DGX SuperPODs
Training LLMs depends on data, and training bigger and better models depends on access to more data than previously available. However, the amount of data in the world isn’t infinite. If it runs out, you’ve hit a ceiling for training models and advancing the capabilities of AI. The answer to this conundrum is synthetic data: data that can be generated at scale, by using AI. The question that researchers increasingly ask is how viable such synthetic data really is for training. Does it become self-referencing and inauthentic? Does it eventually impact on quality?
DeepL has an advantage here, because, over the last five years, our approach to training has increasingly depended on a sophisticated means of generating synthetic data.
Because language is inherently subjective, there’s no real substitute for human input. It’s the ground truth for which forms of expression human beings prefer, which experiences of language they want to have, and which concepts guide their language choices. This is why training DeepL has always required us to bring human insight into our models, and yet find a way to scale that insight to generate more data than you could gather from humans alone.
We’ve done this by developing models that can extrapolate from expert linguistic insight and generate synthetic data, and also models that can effectively evaluate the quality of this data. Our current approach involves boosting the volume of data in this way by a factor of 1,000. That gives us the data foundation we require for deploying the new DGX SuperPOD.
We can do this with confidence, based on the fact that our synthetic data has consistently driven quality improvements over time. This puts us in a contrasting position to the scenarios in well-publicized recent studies, where moving from real data to synthetic data eventually results in performance deteriorations. That’s not our experience. DeepL’s models are designed around synthetic data. It’s a fundamental aspect of putting human linguistic instincts, preferences and expertise at the heart of our training.
Planning for scaled capabilities
DGX SuperPOD with DGX GB200 systems unlocks powerful new possibilities for scaling both training and inference for AI. As I’ve tried to show in this post, realizing these possibilities isn’t just a case of investing in the hardware, installing it in a data center and turning it on. Deploying a DGX SuperPOD is an exercise in adjusting to what’s newly possible: in terms of FLOPS, memory bandwidth and inter-GPU communication, and also the size of the models you build, and the data with which you train them.
This process doesn’t end once you’ve optimized your software, models and data to get the most out of the infrastructure. Many of the capabilities that you get from a technology like the DGX SuperPOD with DGX GB200 systems are perfectly predictable (like the increased compute power, reduced latency and speed for inference). However, others aren’t. There are new and surprising capabilities that emerge as your model gets bigger. Planning your approach to research to enable you to anticipate, discover and leverage these emergent capabilities is another exercise. A fascinating and creative one. I’ll be tackling that in my next post on this blog.
About the Author
A lifelong science and technology enthusiast, Stefan Mesken studied mathematics and computer science at the University of Münster, where he also worked as a researcher before taking on a full-time role as a data scientist to start his career in tech. He started working in AI in 2018, and joined DeepL’s research team in 2020. As Chief Scientist, he shapes the strategy and prioritization of DeepL’s research agenda and coordinates large-scale research initiatives while aligning research goals with engineering and product teams. One of Stefan’s proudest accomplishments is helping lead DeepL’s next-gen model program, a cornerstone in the company’s push toward advanced AI-driven communication. He remains deeply committed to developing world-class tools rooted in cutting-edge research and practical impact.