NVIDIA Launches Dynamo to Boost AI Inference Performance

NVIDIA has announced the launch of NVIDIA Dynamo, a new open-source inference software designed to optimize the performance and cost-efficiency of AI reasoning models. As AI technology becomes more mainstream, enhancing inference performance while lowering associated costs is key to maximizing AI factories' growth and revenue potential that serve these models.
A Game-Changer for AI Factories
NVIDIA Dynamo is the successor to the renowned NVIDIA Triton Inference Server, which has been widely used in the industry. This new platform is built to orchestrate and accelerate inference communication across vast fleets of GPUs.
It uses disaggregated serving to separate the two main stages of large language model (LLM) processing, which are input handling and output generation. By separating these tasks across different GPUs, Dynamo ensures that each phase is optimized for its specific needs, ultimately improving overall resource utilization and efficiency.
The core promise of NVIDIA Dynamo is its ability to maximize token revenue generation in AI factories, particularly those deploying complex reasoning AI models. This means more tokens AI models use to "think" with every prompt can be processed faster, significantly boosting the financial viability of AI services for providers.
Doubling Performance and Revenue
According to NVIDIA, the Dynamo software offers an incredible performance boost when running Llama models on the NVIDIA Hopper platform. Dynamo has been shown to double AI factories' performance and revenue generation, even when using the same number of GPUs. For large clusters of GPUs running models like DeepSeek-R1, the new software has also increased the number of tokens processed per GPU by over 30 times.
NVIDIA Dynamo optimizes inference efficiency by dynamically adjusting GPU resources based on demand, ensuring cost-effective usage. It also offloads data to more affordable storage and memory systems, reducing costs while maintaining performance.
Also read: Bidgely Expands AI-Powered Energy Analytics with Acquisition of Grid4C
The Power of Disaggregated Serving
NVIDIA Dynamo introduces disaggregated serving, where tasks in large language models are processed across different GPUs, improving throughput and response times.
Leading AI companies like Cohere and Together AI are integrating Dynamo to enhance their inference capabilities. Cohere focuses on scaling agentic AI features and Together AI optimizing its model pipeline for better resource utilization.
Dynamo incorporates key innovations such as the GPU Planner, Smart Router, Low-Latency Communication Library, and Memory Manager to improve efficiency and reduce costs. As an open-source platform, it supports various AI frameworks and will be available in NVIDIA NIM microservices, with future support in NVIDIA AI Enterprise, offering security and stability for enterprises and researchers.