Google DeepMind Introduces QuestBench to Evaluate Language Models on Complex Reasoning Tasks

Google DeepMind has unveiled QuestBench, a new benchmark designed to assess whether large language models (LLMs) can identify the key question necessary to solve complex problems in areas like math, logic, and planning. The deep learning team recently published a paper explaining this innovative tool, which focuses on reasoning tasks that can only be solved by asking a single, specific question.
Tackling Underspecified Problems
In traditional applications, LLMs assume tasks are well-defined, meaning all information required to solve them is provided. However, in real-world scenarios, problems are often underspecified. This means users might omit crucial details, or systems like robots may operate in environments with partial visibility. In such cases, LLMs must proactively seek out missing information by asking clarifying questions.
Also read: BD Launches AI-Powered HemoSphere Alta to Advance Hemodynamic Monitoring
QuestBench is designed to evaluate how effectively LLMs can generate these vital questions. The benchmark focuses on underspecified tasks, where the provided information is not enough to solve the problem. The model's challenge is identifying and asking the right question to gather the missing data required to complete the task.
Formalizing the Underspecification Problem
DeepMind's team has formalized the underspecification problem as a constraint satisfaction problem (CSP). In this model, many reasoning tasks involve determining the value of a target variable based on a set of variables and constraints. The problem is considered underspecified when the target variable's value cannot be inferred from the provided information alone. QuestBench’s focus is on the difference between semantic ambiguity (where multiple interpretations exist but can be solved) and underspecification (where additional information is required for a solution).
The benchmark includes four distinct task categories, each involving reasoning under incomplete information: Logic-Q (logical reasoning), Planning-Q (planning problems with partial observations), GSM-Q (grade school math problems), and GSME-Q (math problems translated into equations). For each task, the LLM is required to ask just one question to move forward with the solution.
Performance of Language Models
QuestBench tested several state-of-the-art LLMs, including GPT-4, Claude 3.5, and Gemini 2.0, under various settings such as zero-shot, chain-of-thought, and four-shot evaluation. The results showed that while LLMs performed well on simpler tasks like GSM-Q and GSME-Q, they struggled with more complex tasks, particularly in Logic-Q and Planning-Q domains.
Interestingly, the models excelled in areas with fewer variables and constraints, achieving over 80% accuracy in GSM-Q tasks. However, they struggled to perform beyond 50% accuracy in the more complex logic and planning domains. The study also revealed that LLMs are sensitive to the search depth in these problems, which hints at different problem-solving strategies for different task types.
QuestBench offers a valuable framework for evaluating the ability of LLMs to handle underspecified problems. The findings will help improve the development of LLMs, making them more reliable in real-world applications where missing information often needs to be identified and gathered through clarifying questions. To explore this work further, you can access the full research paper, GitHub project, and the QuestBench dataset under the Apache 2.0 license.