MLCommons Unveils New AI Benchmarks to Measure Speed and Efficiency of Cutting-Edge Hardware

AI industry consortium MLCommons has introduced two new benchmarks designed to evaluate how quickly advanced hardware and software can handle artificial intelligence workloads. These additions to its MLPerf benchmark suite aim to assess the performance of systems that power popular AI applications like chatbots and search engines.
As AI models become more complex and widely used, chipmakers have increasingly prioritized building hardware that can efficiently process the massive volumes of data required for real-time responses. MLCommons’ new benchmarks respond to this trend, enabling clearer insights into how top-tier systems perform under heavy AI workloads.
Benchmark Based on Meta's Llama 3.1 Model
One of the newly released tests is based on Meta Platforms’ Llama 3.1 model, which contains 405 billion parameters. This benchmark is intended to evaluate a system’s capabilities in tasks such as general question answering, mathematical problem-solving, and code generation. It also measures how well a system synthesizes information from multiple sources in response to large and complex queries.
Nvidia was among the major players to submit hardware for this benchmark, including its latest Grace Blackwell server platform, which integrates 72 GPUs. According to Nvidia, even when using just eight GPUs to match its older-generation systems, the new platform demonstrated performance improvements ranging from 2.8 to 3.4 times faster.
The company credits its enhanced chip interconnects for this leap in speed, an essential feature for AI models that operate across multiple processors simultaneously.
System builders like Dell Technologies also took part in the submissions, while rival AMD did not provide entries for the 405-billion-parameter test, MLCommons data showed.
Also read: Retym Raises $180M to Drive Next-Gen AI Infrastructure with Programmable DSP Technology
Second Benchmark Mimics Real-World Consumer AI Use
The second benchmark, also based on an open-source Meta AI model, is designed to reflect performance requirements typical of widely-used consumer AI applications, such as OpenAI’s ChatGPT. This test focuses on how well a system handles user-like interactions and processes information with the speed and reliability expected in mainstream use cases.
By simulating real-world usage patterns, this benchmark aims to help developers and hardware vendors better understand how their systems will perform in practical deployment environments.